Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runTest segfault (remove cudaDeviceReset) and simplify googletest template usage #909

Merged
merged 84 commits into from
Jul 16, 2024

Conversation

valassi
Copy link
Member

@valassi valassi commented Jul 12, 2024

This is a PR to fix #907, finally removing the issues blocking the implementation of two tests #896. This also implied a simplification in the usage of googletest templates in our code. I made a successful proof of concept of adding two tests, which I want for #896 in the work I am doing on master_june24 for channelids (previously this was blocked by bug #907).

@oliviermattelaer can you please review? Note, this PR sits on top of PR #908, which itself sits on top of #900 and #905. So I suggest reviewing and merging in this order

Thanks
Andrea

…n both CPU and GPU (prepare for madgraph5#896) - the C++ tests succeed but the CUDA tests segfaults madgraph5#903
…from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert
…ase-1.11.0

Revert "[gtest/june24] in CODEGEN cudacpp_test.mk, try to upgrade googletest from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert"
This reverts commit 34cd623.
…cc build in CUDA while debugging madgraph5#903

With testmisc.cc, valgrind gives a confusing error

==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2887713==
==2887713== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2887713==  Access not within mapped region at address 0x1FFE801FF8
==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2887713==    at 0x449C06: mg5amcGpu::constexpr_sin_quad(long double, bool) (constexpr_math.h:156)
==2887713==  If you believe this happened as a result of a stack
==2887713==  overflow in your program's main thread (unlikely but
==2887713==  possible), you can try to increase the size of the
==2887713==  main thread stack using the --main-stacksize= flag.
==2887713==  The main thread stack size used in this run was 8388608.
==2887713==
==2887713== HEAP SUMMARY:
==2887713==     in use at exit: 21,309,363 bytes in 13,995 blocks
==2887713==   total heap usage: 18,083 allocs, 4,088 frees, 51,971,780 bytes allocated
==2887713==
==2887713== LEAK SUMMARY:
==2887713==    definitely lost: 0 bytes in 0 blocks
==2887713==    indirectly lost: 0 bytes in 0 blocks
==2887713==      possibly lost: 2,599,608 bytes in 825 blocks
==2887713==    still reachable: 18,709,755 bytes in 13,170 blocks
==2887713==         suppressed: 0 bytes in 0 blocks
==2887713== Rerun with --leak-check=full to see details of leaked memory
==2887713==
==2887713== For lists of detected and suppressed errors, rerun with: -s
==2887713== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Without testmisc.cc instead

[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==2889432== Invalid write of size 8
==2889432==    at 0x484E2DB: memmove (vg_replace_strmem.c:1385)
==2889432==    by 0x41A6EA: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431)
==2889432==    by 0x41A49B: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494)
==2889432==    by 0x41A1A5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522)
==2889432==    by 0x419F4D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529)
==2889432==    by 0x419D0C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619)
==2889432==    by 0x419950: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34)
==2889432==    by 0x44443D: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202)
==2889432==    by 0x440D98: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253)
==2889432==    by 0x48790F: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==2889432==    by 0x480EF8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==2889432==    by 0x459587: testing::Test::Run() (gtest.cc:2682)
==2889432==  Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd
==2889432==
==2889432==
==2889432== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2889432==  Access not within mapped region at address 0x2FC0F200
==2889432==    at 0x484E2DB: memmove (vg_replace_strmem.c:1385)
...
Segmentation fault (core dumped)
…cc build while debugging madgraph5#903 also for C++

The test does not segfault without valgrind, but it does segfault in valgrind!
(NB this all realted to debug builds, in C++ and in CUDA)

And with testmisc.cc, valgrind gives a confusing error for C++ (cppnone here) as in CUDA:

==2893804== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2893804==  Access not within mapped region at address 0x1FFE801FF8
==2893804== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2893804==    at 0x431835: mg5amcCpu::constexpr_sin_quad(long double, bool) (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/runTest_cpp.exe)

So I disable testmisc but now the C++ test (cppnone here) no longer segfaults...?!
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
Revert "[gtest/june24] in gg_tt.mad cudacpp.mk, TEMPORARELY disable testmisc.cc build while debugging madgraph5#903 also for C++"
This reverts commit 944caab.

Will now test with clang16 (after recent fixes) and valgrind (after upgrading to 3.23)
…ster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…ODEGEN logs from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…g constexpr_sin: now valgrind on c++ runTest succeds again?!

However cuda still fails (even without valgrind) madgraph5#903
… now valgrind runTest_cpp.exe will fail

Revert "[gtest/june24] in gg_tt.mad testmisc.cc, comment out the section using constexpr_sin: now valgrind on c++ runTest succeds again?!"
This reverts commit 975f7aacb8661807a329ec1f51b2d7d8dba45167.
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…et() at the end, but an abort reappears

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (194 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (194 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (174 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (174 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (384 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)
…st.cc to the main in testxxx.cc, but an abort reappears

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (180 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (180 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (395 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)
… to the atexit function, but this STILL crashes! madgraph5#907

WILL THEREFORE COMMENT OUT THIS CALL...

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (179 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (179 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (393 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)
… to avoid all crashes madgraph5#907 (FIXME? avoid cuda api calls in dtors?)

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (199 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (199 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (181 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (181 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (396 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
INFO: No Floating Point Exceptions have been reported
…ting::Test argument to the compareME function, to allow the use f HasFailure

This essentially COMPLETES the fixes for madgraph5#907 and preparatory work for madgraph5#896
…pare to comment out test2 (preparatory work for madgraph5#896)

All tests succeed on cuda and all simd
…ry work for madgraph5#896)

All tests succeed on cuda and all simd - will backport to CODEGEN now
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896

More in detail:
- move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating)
- move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907
  (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in)
- in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896)

Fix some clang formatting issues with respect to the last gg_tt.mad
… the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
Copy link
Member

@oliviermattelaer oliviermattelaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrea,

Thanks for this, this sounds good and no issue.
Obviously, this is another PR based on #905 which means that the status of this one, rely on it (like #908). Like in #908, I do approve this PR but we need to agree on what to do for #905 first (which should be quite easy)

Thanks,

Olivier

…r if OpenMP builds are attempted on clang16/17 (as discussed with Olivier in madgraph5#905)
…s from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with OMP madgraph5#900 and submod madgraph5#897) into gtest

Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt
  git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt

Note: MG5AMC has been updated including mg5amcnlo#107
…s from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with clang madgraph5#905, OMP madgraph5#900 and submod madgraph5#897) into gtest

Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt
  git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt

Note: MG5AMC has been updated including mg5amcnlo#107
…ogs from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
@valassi
Copy link
Member Author

valassi commented Jul 16, 2024

Hi Andrea,

Thanks for this, this sounds good and no issue. Obviously, this is another PR based on #905 which means that the status of this one, rely on it (like #908). Like in #908, I do approve this PR but we need to agree on what to do for #905 first (which should be quite easy)

Thanks,

Olivier

Thanks Olivier :-)

I again updated this and regenerated as a check. Will run the CI then merge.

Andrea

@valassi
Copy link
Member Author

valassi commented Jul 16, 2024

The CI completed with
163 successful and 6 failing checks
This is as expected

Merging now

@valassi valassi merged commit 606ee3b into madgraph5:master Jul 16, 2024
163 of 169 checks passed
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…ion, removing the attempts to add two tests madgraph5#896

My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908).
I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again.

git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…ng PR madgraph5#905, constexpr_math.h PR madgraph5#908 and runTest/cudaDeviceReset PR madgraph5#909

Add valgrind.h and its symlink in the repo for gg_tt.mad

The new runTest.cc template now has a (commented out) proof of concept for including two tests (with/without multichannel) madgraph5#896, I will resume from there

After building bldall, the following succeeds
for bck in none sse4 avx2 512y 512z cuda; do echo $bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done

This instead is crashing (again?) for some AVX values
for bck in none sse4 avx2 512y 512z cuda; do echo $bck; valgrind ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done
On closer inspection, this is because valgrind does not support AVX512, so this is ok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

segfault in CommonRandomNumberKernel for cuda when adding a second gtest
2 participants