Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiOp segfaults on nlpDenseCons_ex1 tests when built with default options #1

Closed
goxberry opened this issue Jan 24, 2018 · 20 comments
Closed

Comments

@goxberry
Copy link
Member

Building HiOp with default options on macOS 10.12.6 and running the tests yields errors for the nlpDenseCons_ex1 tests. Log files can be found at: https://gist.github.com/goxberry/8bdc80e6dcd4d15ed0a7c5130009d6aa

The configuration I'm using is built by spack, so I have some flexibility in choosing libraries, but all of these libraries are included via RPATH directives. My impression is that linking isn't an issue, but I could be wrong about that.

@cnpetra
Copy link
Collaborator

cnpetra commented Jan 25, 2018

Thanks, will look into it once I find a Mac. Can you provide any info on LAPACK/BLAS u’re using? At first sight, it looks like the issue is related to that.

@goxberry
Copy link
Member Author

I think it's linking to the Accelerate framework right now. I'll try linking to a different LAPACK library.

@goxberry
Copy link
Member Author

I see BLAS calls in HiOp that look like, for instance, dgemv_, which assumes Fortran symbols are exported with trailing underscores. On my system, Fortran name-mangling uses leading and trailing underscores, so the BLAS library from reference LAPACK exports the symbol _dgemv_, and so does OpenBLAS:

snippet of nm libopenblas.a from OpenBLAS:

/Users/oxberry1/spack/opt/spack/darwin-sierra-x86_64/clang-8.1.0-apple/openblas-0.2.20-dxpaoysvipwyqcscj5n5j6t5cvqtldr2/lib/libopenblas.a(dgemv.o):
                   U ___assert_rtn
                   U ___stack_chk_fail
                   U ___stack_chk_guard
                   U _blas_memory_alloc
                   U _blas_memory_free
  0000000000000000 T _dgemv_
                   U _dgemv_n
                   U _dgemv_t
                   U _dscal_k
                   U _xerbla_
  0000000000000300 s l_dgemv_.gemv

snippet of nm libblas.a from reference LAPACK:

/Users/oxberry1/spack/opt/spack/darwin-sierra-x86_64/clang-8.1.0-apple/netlib-lapack-3.7.1-hxqwiepbfb3forawjin3gmj6rpb6cmxk/lib/libblas.a(dgemv.f.o):
 0000000000000780 s EH_frame1
 0000000000000000 T _dgemv_
                  U _lsame_
                  U _xerbla_
 00000000000006c3 s lC1
 00000000000006c4 s lC2
 00000000000006c5 s lC3
 00000000000006d0 s lC4
 00000000000006c6 s lC5

The compiler stack I've been using is Apple Clang 8.1 (from XCode 8.3) and gfortran 7.2.0 (from GCC 7.2.0). There are a bunch of ways around this issue, but none of them are quick:

  • on my side, I could force gfortran to export Fortran symbols without a leading underscore using -fno-leading-underscore, but that option means that other libraries will also expect LAPACK without leading underscores, so it might break other libraries

  • hypre uses C preprocessor macros to add leading and trailing underscores as needed, and then uses GNU Autotools to detect how Fortran symbols are exported

  • use CBLAS and LAPACKE, which is a standard part of LAPACK as of version 3.4.0

  • add a fake Fortran file and use CMake to detect the symbol mangling, which adds an unnecessary Fortran dependency to the project

For now, I'll stick to using HiOp on the clusters, because I know it works there. If I need to run HiOp on my laptop for some reason, I can consult with you and put together a patch, if you're interested.

@cnpetra
Copy link
Collaborator

cnpetra commented Jan 27, 2018

@junkudo has a fix for the fortran name mangling and will get in here soon...

@junkudo
Copy link

junkudo commented Jan 27, 2018

I can fix the fortran name mangling this weekend. :)

@jandrej
Copy link
Member

jandrej commented Feb 20, 2018

I get segfaults as well using clang5 on linux when I try to run the examples. With GCC everything works fine.

@cnpetra
Copy link
Collaborator

cnpetra commented Feb 21, 2018

Julian, thanks for reporting. Apparently, hiop has all kind of issues with clang.

clang5 means it's coming with llvm5, or its clang version 5.0 ?

I only have clang v3.4.2 on my linux box.

clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-final)
Target: x86_64-redhat-linux-gnu
Thread model: posix

@jandrej
Copy link
Member

jandrej commented Feb 21, 2018

It's clang version 5.0 from the llvm ubuntu repositories. Let me know if I can give further information which could help.

~$ clang -v
clang version 5.0.1-svn325091-1~exp1 (branches/release_50)
Target: x86_64-pc-linux-gnu
Thread model: posix

@jandrej
Copy link
Member

jandrej commented Feb 22, 2018

Just for your info. The disassembly states a UD2 instruction. This means clang recognized undefined behavior in the code. It's located in

bool hiopHessianLowRank::updateLogBarrierDiagonal(const hiopVector& Dx)

A run with -fsanitize=undefined returns

/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:135:10: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:43:28: note: nonnull attribute specified here
/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:135:16: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:28: note: nonnull attribute specified here
/home/juan/repos/hiop/src/Optimization/hiopKKTLinSys.cpp:539:27: runtime error: execution reached the end of a value-returning function without returning a value

@cnpetra
Copy link
Collaborator

cnpetra commented Sep 25, 2018

I see BLAS calls in HiOp that look like, for instance, dgemv_, which assumes Fortran symbols are exported with trailing underscores. On my system, Fortran name-mangling uses leading and trailing underscores, so the BLAS library from reference LAPACK exports the symbol _dgemv_, and so does OpenBLAS:

snippet of nm libopenblas.a from OpenBLAS:

/Users/oxberry1/spack/opt/spack/darwin-sierra-x86_64/clang-8.1.0-apple/openblas-0.2.20-dxpaoysvipwyqcscj5n5j6t5cvqtldr2/lib/libopenblas.a(dgemv.o):
                   U ___assert_rtn
                   U ___stack_chk_fail
                   U ___stack_chk_guard
                   U _blas_memory_alloc
                   U _blas_memory_free
  0000000000000000 T _dgemv_
                   U _dgemv_n
                   U _dgemv_t
                   U _dscal_k
                   U _xerbla_
  0000000000000300 s l_dgemv_.gemv

snippet of nm libblas.a from reference LAPACK:

/Users/oxberry1/spack/opt/spack/darwin-sierra-x86_64/clang-8.1.0-apple/netlib-lapack-3.7.1-hxqwiepbfb3forawjin3gmj6rpb6cmxk/lib/libblas.a(dgemv.f.o):
 0000000000000780 s EH_frame1
 0000000000000000 T _dgemv_
                  U _lsame_
                  U _xerbla_
 00000000000006c3 s lC1
 00000000000006c4 s lC2
 00000000000006c5 s lC3
 00000000000006d0 s lC4
 00000000000006c6 s lC5

The compiler stack I've been using is Apple Clang 8.1 (from XCode 8.3) and gfortran 7.2.0 (from GCC 7.2.0). There are a bunch of ways around this issue, but none of them are quick:

  • on my side, I could force gfortran to export Fortran symbols without a leading underscore using -fno-leading-underscore, but that option means that other libraries will also expect LAPACK without leading underscores, so it might break other libraries
  • hypre uses C preprocessor macros to add leading and trailing underscores as needed, and then uses GNU Autotools to detect how Fortran symbols are exported
  • use CBLAS and LAPACKE, which is a standard part of LAPACK as of version 3.4.0
  • add a fake Fortran file and use CMake to detect the symbol mangling, which adds an unnecessary Fortran dependency to the project

For now, I'll stick to using HiOp on the clusters, because I know it works there. If I need to run HiOp on my laptop for some reason, I can consult with you and put together a patch, if you're interested.

@goxberry : I've finally got my hands on a mac laptop and tested the solver. @junkudo 's fortran name mangling works like a champ. I've only fixed a couple of compilation warnings. Everything works fine. I've used clang + gfortran + blas from accelarate (thanks again for the instructions!) It would be awesome if you could give it try on your system.

@cnpetra
Copy link
Collaborator

cnpetra commented Sep 25, 2018

Just for your info. The disassembly states a UD2 instruction. This means clang recognized undefined behavior in the code. It's located in

bool hiopHessianLowRank::updateLogBarrierDiagonal(const hiopVector& Dx)

A run with -fsanitize=undefined returns

/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:135:10: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:43:28: note: nonnull attribute specified here
/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:135:16: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:28: note: nonnull attribute specified here
/home/juan/repos/hiop/src/Optimization/hiopKKTLinSys.cpp:539:27: runtime error: execution reached the end of a value-returning function without returning a value

@jandrej : tried -fsanitized and could not replicate your errors. Probably because those problems were fixed in the meanwhile -- I did a lot of valgrinding on linux on the library under many use cases within mfem recently and and fixed a couple of uninitialized memory accesses. It would be awesome if you can check again and see what happens. thanks

@jandrej
Copy link
Member

jandrej commented Sep 25, 2018

My cmake command is

CC=clang cmake -DCMAKE_CXX_FLAGS="-fsanitize=undefined" ../

which still produces the runtime error when running ex1

/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:137:56: runtime error: null pointer passed as argument 1, which is declared to never be null
/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:137:56: runtime error: null pointer passed as argument 2, which is declared to never be null

using the latest version of the repo.

@goxberry
Copy link
Member Author

Thank you both for patching this issue! I’m currently on travel, and will take a look after I return, probably no later than Monday.

@cnpetra
Copy link
Collaborator

cnpetra commented Sep 26, 2018

My cmake command is

CC=clang cmake -DCMAKE_CXX_FLAGS="-fsanitize=undefined" ../

which still produces the runtime error when running ex1

/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:137:56: runtime error: null pointer passed as argument 1, which is declared to never be null
/home/juan/repos/hiop/src/LinAlg/hiopMatrix.cpp:137:56: runtime error: null pointer passed as argument 2, which is declared to never be null

using the latest version of the repo.

thanks! I did get runtime errors with your cmake command, one coming from hiopMatrix.cpp:137:56, though they were all related to allocating an array of size 0. It may be that we're using different versions of clang (?). In any case, I've addressed all the errors I was seeing. Could you please pull the master and see what you're getting?

@jandrej
Copy link
Member

jandrej commented Sep 26, 2018

The example nlpDenseCons_ex1 is not crashing but still throws the "runtime error" from clang sanitize.

@cnpetra
Copy link
Collaborator

cnpetra commented Oct 1, 2018

could not replicate on my redhat machine. I've used clang 3.4.2 though.

@goxberry Could you please see if you get any errors with fsanitize when you're testing the fix for this issue?

Use something like

rm -rf *; CC=clang CXX=clang++ cmake -DCMAKE_CXX_FLAGS="-fsanitize=nullability,undefined,integer,alignment" -DHIOP_USE_MPI=ON -DHIOP_DEEPCHECKS=ON -DCMAKE_BUILD_TYPE=DEBUG ..; make -j4

and run ./src/Drivers/nlpDenseCons_ex1.exe

@goxberry
Copy link
Member Author

goxberry commented Oct 1, 2018

@cnpetra HiOp compiles and runs successfully with and without the -fsanitize=... flags on macOS 10.12.6 with OpenBLAS 0.3.3, gfortran 8.2.0, and AppleClang 9.0.0 (based on LLVM clang 4.0). I can't reproduce the behavior @jandrej is seeing. Maybe I need to build with a later version of LLVM's clang?

@cnpetra
Copy link
Collaborator

cnpetra commented Oct 4, 2018

yeah, it's strange we don't get the runtime error of @jandrej

looked at the code again, and, apparently passing null pointers to memcpy is not allowed even when the number of bytes to be copied is zero. I safequarded memcpy from null pointers in the latest commit.

@jandrej : is it too much to ask to try again? :)

@jandrej
Copy link
Member

jandrej commented Oct 4, 2018

The last commit seemed to clear the errors for clang! I don't see any warnings/errors anymore during the runs of the examples.

@cnpetra
Copy link
Collaborator

cnpetra commented Oct 4, 2018

great! closing the issue.

@cnpetra cnpetra closed this as completed Oct 4, 2018
cnpetra pushed a commit that referenced this issue Apr 14, 2020
cnpetra pushed a commit that referenced this issue May 12, 2020
Improvements to CMake configuration
ashermancinelli pushed a commit that referenced this issue Jun 22, 2021
Shri/scopflow major refactor

Closes #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11, #12, #14, #17, #18, and #15

See merge request exasgd/frameworks/scopflow!2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants