Skip to content

PETSc.jl Ubuntu + DistributedStokesTests.jl issues #23

@amartinhuertas

Description

@amartinhuertas

I have found a clear lack of robustness of DistributedStokesTests.jl, at least in the status corresponding to commit 94671c9 (code available here) when combined with the PETSc binaries provided by Ubuntu in my system, in particular:

libpetsc3.7-dev/bionic,now 3.7.7+dfsg1-2build5 amd64 [installed,automatic]
libpetsc3.7.7/bionic,now 3.7.7+dfsg1-2build5 amd64 [installed,automatic]
libpetsc3.7.7-dbg/bionic 3.7.7+dfsg1-2build5 amd64
libpetsc3.7.7-dev/bionic,now 3.7.7+dfsg1-2build5 amd64 [installed,automatic]

(I did not test with other systems, nor other versions of the Ubuntu Petsc package)

The error has the form of a SEGFAULT within the second invocation to the run method here. You may see the the error output below. Also worths mentioning that if I comment the two run() invokations in the middle, then the program stucks (deadlock) in the second run() invokation. When hiting Ctrl+c to abort the program, then I get the following strack trace (presumibly the point at which the program is stucked):

in expression starting at /home/amartin/git-repos/GridapDistributed.jl/test/DistributedStokesTests.jl:147
pthread_cond_signal at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
exec_blas_async at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libopenblas64_.so (unknown line)
exec_blas at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libopenblas64_.so (unknown line)
gemm_thread_m at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libopenblas64_.so (unknown line)
dtrsm_64_ at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libopenblas64_.so (unknown line)
umfdl_blas3_update at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libumfpack.so.5 (unknown line)
umfdl_kernel at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libumfpack.so.5 (unknown line)
umfpack_dl_numeric at /home/amartin/software_installers/julia-1.4.0/bin/../lib/julia/libumfpack.so.5 (unknown line)
umfpack_numeric! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/SuiteSparse/src/umfpack.jl:268
#lu#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/SuiteSparse/src/umfpack.jl:161
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/SuiteSparse/src/umfpack.jl:155 [inlined]
numerical_setup at /home/amartin/git-repos/Gridap.jl/src/Algebra/LinearSolvers.jl:237 [inlined]
solve! at /home/amartin/git-repos/Gridap.jl/src/Algebra/LinearSolvers.jl:184 [inlined]
solve! at /home/amartin/git-repos/Gridap.jl/src/Algebra/NonlinearSolvers.jl:22
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
solve! at /home/amartin/git-repos/Gridap.jl/src/FESpaces/FESolvers.jl:103
solve! at /home/amartin/git-repos/Gridap.jl/src/FESpaces/FESolvers.jl:15
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
solve at /home/amartin/git-repos/Gridap.jl/src/FESpaces/FESolvers.jl:37
solve at /home/amartin/git-repos/Gridap.jl/src/FESpaces/FESolvers.jl:43 [inlined]
run at /home/amartin/git-repos/GridapDistributed.jl/test/DistributedStokesTests.jl:122
unknown function (ip: 0x7effd26b78ab)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
#23 at /home/amartin/git-repos/GridapDistributed.jl/test/DistributedStokesTests.jl:151
SequentialCommunicator at /home/amartin/git-repos/GridapDistributed.jl/src/SequentialCommunicators.jl:8
unknown function (ip: 0x7effd26b68b2)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2158 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1692 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:369
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:458
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:409 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:817
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:911
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:814
jl_eval_module_expr at /buildworker/worker/package_linux64/build/src/toplevel.c:181
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:640
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:872
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:872
include at ./Base.jl:377
exec_options at ./client.jl:288
_start at ./client.jl:484
jfptr__start_30059 at /home/amartin/git-repos/Gridap.jl/compile/Gridapv0.12.0.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
unknown function (ip: (nil))
Allocations: 72926458 (Pool: 72918467; Big: 7991); GC: 85

I tried to find the cause of the problem, but I did not suceed so far. It seems to be happening in the numerical solution stage within UMFPACK. The surprising thing is that neither with (1) my own compilation of PETSc 3.9.0 (that was not compiled with umfpack support) nor (2) with the PETSc binaries in Travis (that, as far as I know, are also compiled with UMFPACK support) the issue seems to be being reproduced. I have as a possible hypothesis (closer to speculation than to an actual fact) that UMFPACK compiled within PETSc might be somehow conflicting with UMFPACK within the Julia REPL, but I do not have any evidence in this direction.

1.3061638074924613e-30 < 1.0e-9
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.7, Sep, 25, 2017 
[0]PETSC ERROR: julia on a x86_64-linux-gnu-real named sistemas-ThinkPad-X1-Carbon-6th by amartin Fri Jul 17 18:41:22 2020
[0]PETSC ERROR: Configure options --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --with-silent-rules=0 --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --with-maintainer-mode=0 --with-dependency-tracking=0 --with-debugging=0 --shared-library-extension=_real --with-clanguage=C++ --with-shared-libraries --with-pic=1 --useThreads=0 --with-fortran-interfaces=1 --with-mpi-dir=/usr/lib/x86_64-linux-gnu/openmpi --with-blas-lib=-lblas --with-lapack-lib=-llapack --with-scalapack=1 --with-scalapack-lib=-lscalapack-openmpi --with-mumps=1 --with-mumps-include="[]" --with-mumps-lib="-ldmumps -lzmumps -lsmumps -lcmumps -lmumps_common -lpord" --with-suitesparse=1 --with-suitesparse-include=/usr/include/suitesparse --with-suitesparse-lib="-lumfpack -lamd -lcholmod -lklu" --with-spooles=1 --with-spooles-include=/usr/include/spooles --with-spooles-lib=-lspooles --with-ptscotch=1 --with-ptscotch-include=/usr/include/scotch --with-ptscotch-lib="-lptesmumps -lptscotch -lptscotcherr" --with-fftw=1 --with-fftw-include="[]" --with-fftw-lib="-lfftw3 -lfftw3_mpi" --with-superlu=1 --with-superlu-include=/usr/include/superlu --with-superlu-lib=-lsuperlu --with-hdf5=1 --with-hdf5-dir=/usr/lib/x86_64-linux-gnu/hdf5/openmpi --CXX_LINKER_FLAGS=-Wl,--no-as-needed --with-hypre=1 --with-hypre-include=/usr/include/hypre --with-hypre-lib="-lHYPRE_IJ_mv -lHYPRE_parcsr_ls -lHYPRE_sstruct_ls -lHYPRE_sstruct_mv -lHYPRE_struct_ls -lHYPRE_struct_mv -lHYPRE_utilities" --prefix=/usr/lib/petscdir/3.7.7/x86_64-linux-gnu-real PETSC_DIR=/build/petsc-vurd6G/petsc-3.7.7+dfsg1 --PETSC_ARCH=x86_64-linux-gnu-real CFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-vurd6G/petsc-3.7.7+dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" CXXFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-vurd6G/petsc-3.7.7+dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" FCFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-vurd6G/petsc-3.7.7+dfsg1=. -fstack-protector-strong -fPIC" FFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-vurd6G/petsc-3.7.7+dfsg1=. -fstack-protector-strong -fPIC" CPPFLAGS="-Wdate-time -D_FORTIFY_SOURCE=2" LDFLAGS="-Wl,-Bsymbolic-functions -Wl,-z,relro -fPIC" MAKEFLAGS=w
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions