Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in CommonRandomNumberKernel for cuda when adding a second gtest #907

Closed
valassi opened this issue Jul 12, 2024 · 3 comments · Fixed by #909
Closed

segfault in CommonRandomNumberKernel for cuda when adding a second gtest #907

valassi opened this issue Jul 12, 2024 · 3 comments · Fixed by #909
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jul 12, 2024

I repeat here what I had initially filed in #903. Unfortunately there were two issues

  • bugs in constexpr_math.h segfault in constexpr_sin_quad when running runTest.exe through valgrind #903, now fixed, cauding endless recursion, stack overflow, segfaults (in c++ and cuda, especially or only through valgrind), and this was with upstream/master code with a single gtest
  • a segfault that seems to come from CommonRandomNumberKernel, only in cuda, and only when adding a second gtest

This bug report here is about the latter issue. Which is a blocker for the way I am implementing #896 two tests for channelids


I am working on adding tests for two warps with different channel #896

The working idea is to have two tests in runTests.cc (one with and one without channelids)

As a preliminary check of the infrastructure, I duplicated the existing no-multchannel test. This gives a strange and unexpected segfault in a surprising place

The code change is

diff --git a/epochX/cudacpp/gg_tt.mad/SubProcesses/runTest.cc b/epochX/cudacpp/gg_tt.mad/SubProcesses/runTest.cc
index 3cc650d66..2adddc2ab 100644
--- a/epochX/cudacpp/gg_tt.mad/SubProcesses/runTest.cc
+++ b/epochX/cudacpp/gg_tt.mad/SubProcesses/runTest.cc
@@ -245,21 +245,35 @@ struct CUDATest : public CUDA_CPU_TestBase
 // Use two levels of macros to force stringification at the right level
 // (see https://gcc.gnu.org/onlinedocs/gcc-3.0.1/cpp_3.html#SEC17 and https://stackoverflow.com/a/3419392)
 // Google macro is in https://github.com/google/googletest/blob/master/googletest/include/gtest/gtest-param-test.h
-#define TESTID_CPU( s ) s##_CPU
-#define XTESTID_CPU( s ) TESTID_CPU( s )
-#define MG_INSTANTIATE_TEST_SUITE_CPU( prefix, test_suite_name ) \
+#define TESTID_CPU1( s ) s##_CPU1
+#define XTESTID_CPU1( s ) TESTID_CPU1( s )
+#define MG_INSTANTIATE_TEST_SUITE_CPU1( prefix, test_suite_name ) \
 INSTANTIATE_TEST_SUITE_P( prefix, \
                           test_suite_name, \
                           testing::Values( new CPUTest( MG_EPOCH_REFERENCE_FILE_NAME ) ) );
-#define TESTID_GPU( s ) s##_GPU
-#define XTESTID_GPU( s ) TESTID_GPU( s )
-#define MG_INSTANTIATE_TEST_SUITE_GPU( prefix, test_suite_name ) \
+#define TESTID_CPU2( s ) s##_CPU2
+#define XTESTID_CPU2( s ) TESTID_CPU2( s )
+#define MG_INSTANTIATE_TEST_SUITE_CPU2( prefix, test_suite_name ) \
+INSTANTIATE_TEST_SUITE_P( prefix, \
+                          test_suite_name, \
+                          testing::Values( new CPUTest( MG_EPOCH_REFERENCE_FILE_NAME ) ) );
+#define TESTID_GPU1( s ) s##_GPU1
+#define XTESTID_GPU1( s ) TESTID_GPU1( s )
+#define MG_INSTANTIATE_TEST_SUITE_GPU1( prefix, test_suite_name ) \
+INSTANTIATE_TEST_SUITE_P( prefix, \
+                          test_suite_name, \
+                          testing::Values( new CUDATest( MG_EPOCH_REFERENCE_FILE_NAME ) ) );
+#define TESTID_GPU2( s ) s##_GPU2
+#define XTESTID_GPU2( s ) TESTID_GPU2( s )
+#define MG_INSTANTIATE_TEST_SUITE_GPU2( prefix, test_suite_name ) \
 INSTANTIATE_TEST_SUITE_P( prefix, \
                           test_suite_name, \
                           testing::Values( new CUDATest( MG_EPOCH_REFERENCE_FILE_NAME ) ) );
 
 #ifdef MGONGPUCPP_GPUIMPL
-MG_INSTANTIATE_TEST_SUITE_GPU( XTESTID_GPU( MG_EPOCH_PROCESS_ID ), MadgraphTest );
+MG_INSTANTIATE_TEST_SUITE_GPU1( XTESTID_GPU1( MG_EPOCH_PROCESS_ID ), MadgraphTest );
+MG_INSTANTIATE_TEST_SUITE_GPU2( XTESTID_GPU2( MG_EPOCH_PROCESS_ID ), MadgraphTest );
 #else
-MG_INSTANTIATE_TEST_SUITE_CPU( XTESTID_CPU( MG_EPOCH_PROCESS_ID ), MadgraphTest );
+MG_INSTANTIATE_TEST_SUITE_CPU1( XTESTID_CPU1( MG_EPOCH_PROCESS_ID ), MadgraphTest );
+MG_INSTANTIATE_TEST_SUITE_CPU2( XTESTID_CPU2( MG_EPOCH_PROCESS_ID ), MadgraphTest );
 #endif /* clang-format on */

The segfault is

Thread 1 "runTest_cuda.ex" received signal SIGSEGV, Segmentation fault.
0x00007ffff58ca9ad in __memmove_evex_unaligned_erms () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64 nvidia-driver-cuda-libs-530.30.02-1.el9.x86_64
(gdb) where
#0  0x00007ffff58ca9ad in __memmove_evex_unaligned_erms () from /lib64/libc.so.6
#1  0x000000000041a72b in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double> (__first=0x1061020, __last=0x1065020, 
    __result=0x7fffcdc0f200) at /usr/include/c++/11/bits/stl_algobase.h:431
#2  0x000000000041a4dc in std::__copy_move_a2<false, double*, double*> (__first=0x1061020, __last=0x1065020, __result=0x7fffcdc0f200)
    at /usr/include/c++/11/bits/stl_algobase.h:494
#3  0x000000000041a1e6 in std::__copy_move_a1<false, double*, double*> (__first=0x1061020, __last=0x1065020, __result=0x7fffcdc0f200)
    at /usr/include/c++/11/bits/stl_algobase.h:522
#4  0x0000000000419f8e in std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>
    (__first=0.68838450471256951, __last=0, __result=0x7fffcdc0f200) at /usr/include/c++/11/bits/stl_algobase.h:529
#5  0x0000000000419d4d in std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*> (
    __first=0.68838450471256951, __last=0, __result=0x7fffcdc0f200) at /usr/include/c++/11/bits/stl_algobase.h:619
#6  0x0000000000419991 in mg5amcGpu::CommonRandomNumberKernel::generateRnarray (this=0x7fffffffccb0)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CommonRandomNumberKernel.cc:34
#7  0x000000000044e7c8 in CUDATest::prepareRandomNumbers (this=0x1453fb0, iiter=0)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/runTest.cc:202
#8  0x000000000044b12f in MadgraphTest_CompareMomentaAndME_Test::TestBody (this=0xa234a0)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/MadgraphTest.h:253
#9  0x00000000004919d2 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0xa234a0, 
    method=&virtual testing::Test::TestBody(), location=0x51ab73 "the test body")
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2607
#10 0x000000000048b097 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0xa234a0, 
    method=&virtual testing::Test::TestBody(), location=0x51ab73 "the test body")
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2643
#11 0x00000000004637c8 in testing::Test::Run (this=0xa234a0) at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2682
#12 0x0000000000464119 in testing::TestInfo::Run (this=0xcde490)
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2861
#13 0x0000000000464929 in testing::TestSuite::Run (this=0xbc5da0)
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:3015
#14 0x0000000000473e82 in testing::internal::UnitTestImpl::RunAllTests (this=0x635790)
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:5855
#15 0x0000000000492b23 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x635790, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x473abe <testing::internal::UnitTestImpl::RunAllTests()>, location=0x51b5d8 "auxiliary test code (environments or event listeners)")
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2607
#16 0x000000000048c00f in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x635790, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x473abe <testing::internal::UnitTestImpl::RunAllTests()>, location=0x51b5d8 "auxiliary test code (environments or event listeners)")
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:2643
#17 0x0000000000472828 in testing::UnitTest::Run (this=0x620220 <testing::UnitTest::GetInstance()::instance>)
    at /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest.cc:5438
#18 0x000000000043b078 in RUN_ALL_TESTS () at ../../../../../test/googletest/install_gcc11.3.1/include/gtest/gtest.h:2490
#19 0x000000000043a984 in main (argc=1, argv=0x7fffffffdb48)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/testxxx.cc:439

NB after the fixes for #903 in constexpr_math.h, this only happens in the cuda code while the same code in c++ is ok.

@valassi
Copy link
Member Author

valassi commented Jul 12, 2024

After cleaning all the rest, this is it

make BACKEND=cuda -f cudacpp.mk -j debug

valgrind ./runTest_cuda.exe 
...
[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess)
==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess)
==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess)
INFO: No Floating Point Exceptions have been reported
==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess)
[       OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (1909 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (1909 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==3242005== Invalid write of size 8
==3242005==    at 0x485167B: memmove (vg_replace_strmem.c:1414)
==3242005==    by 0x41A72A: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431)
==3242005==    by 0x41A4DB: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494)
==3242005==    by 0x41A1E5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522)
==3242005==    by 0x419F8D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529)
==3242005==    by 0x419D4C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619)
==3242005==    by 0x419990: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34)
==3242005==    by 0x450287: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202)
==3242005==    by 0x44CCB2: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253)
==3242005==    by 0x493147: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==3242005==    by 0x48C80C: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==3242005==    by 0x464F3D: testing::Test::Run() (gtest.cc:2682)
==3242005==  Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd
==3242005== 
==3242005== 
==3242005== Process terminating with default action of signal 11 (SIGSEGV)
==3242005==  Access not within mapped region at address 0x2FC0F200

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…licate test

Revert "[gtest/june24] Remove the duplicate test... the segfault in c++ through valgrind is still there! madgraph5#903"
This reverts commit 9326c2b.

Now the segfaults madgraph5#903 and issue madgraph5#906 have disappeared, but the segfault madgraph5#907 in cuda has reappeared...

make BACKEND=cuda -f cudacpp.mk -j debug

[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess)
==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess)
==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess)
INFO: No Floating Point Exceptions have been reported
==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess)
[       OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (1909 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (1909 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==3242005== Invalid write of size 8
==3242005==    at 0x485167B: memmove (vg_replace_strmem.c:1414)
==3242005==    by 0x41A72A: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431)
==3242005==    by 0x41A4DB: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494)
==3242005==    by 0x41A1E5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522)
==3242005==    by 0x419F8D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529)
==3242005==    by 0x419D4C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619)
==3242005==    by 0x419990: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34)
==3242005==    by 0x450287: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202)
==3242005==    by 0x44CCB2: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253)
==3242005==    by 0x493147: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==3242005==    by 0x48C80C: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==3242005==    by 0x464F3D: testing::Test::Run() (gtest.cc:2682)
==3242005==  Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd
==3242005==
==3242005==
==3242005== Process terminating with default action of signal 11 (SIGSEGV)
==3242005==  Access not within mapped region at address 0x2FC0F200
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…tempts to duplicate tests using the current infrastructure

The code is now identical to the current gtest branch for PR madgraph5#908.
Will instead solve madgraph5#907 by using a much simpler test infrastructure with fewer template levels.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, try a different way to duplicate the tests, it still segfaults - will revert"
This reverts commit 24e00a2.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add debug printout for test ctor/dtor"
This reverts commit 0700a85.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, temporarely add back the duplicate test"
This reverts commit 0cce7fb.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…to remove unnecessary levels of templating from GoogleTest.

With the new implementation, successfully duplicate the madgraoh comparison test in both CPU and CUDA, avoiding segfaults madgraph5#907

There are only two issues left to solve:
- First, I had to comment out some "HasFailures" which otherwise was preventing compilation
- Second, the gpuDeviceReset must be moved elsewhere, in a single place for both tests

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (180 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (180 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (395 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
INFO: No Floating Point Exceptions have been reported
ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…tests and add back the duplication of tests - the segfault madgraph5#907 disappears if gpuDeviceReset is removed!

git revert b6dcf4e
git cherry-pick 0cce7fb

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
INFO: No Floating Point Exceptions have been reported
[       OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (200 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (200 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
INFO: No Floating Point Exceptions have been reported
[       OK ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 (181 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest (181 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (398 ms total)
[  PASSED  ] 4 tests.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…he segfault does reappear! THIS PROVES THAT GPUDEVICERESET CAUSES madgraph5#907

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, comment out gpuDeviceReset(), this makes the previous problems disappear (but there is a leak...)"
This reverts commit 503a69d.

(The code is now identical to 0cce7fb)

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
INFO: No Floating Point Exceptions have been reported
[       OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (226 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (226 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
Segmentation fault (core dumped)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…he simpler gtest infrastructure, must fix the leak by calling reset>

(The code is now identical to 503a69d)

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add back gpuDeviceReset(), the segfault does reappear! THIS PROVES THAT GPUDEVICERESET CAUSES madgraph5#907"
This reverts commit a914fa2.

Revert "[gtest2/june24] in gg_tt.mad, temporarely revert the simplication of tests and add back the duplication of tests - the segfault madgraph5#907 disappears if gpuDeviceReset is removed!"
This reverts commit d154507.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
… to the atexit function, but this STILL crashes! madgraph5#907

WILL THEREFORE COMMENT OUT THIS CALL...

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (179 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (179 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (393 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
… to avoid all crashes madgraph5#907 (FIXME? avoid cuda api calls in dtors?)

INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (199 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (199 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
[       OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (181 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (181 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (396 ms total)
[  PASSED  ] 4 tests.
INFO: No Floating Point Exceptions have been reported
INFO: No Floating Point Exceptions have been reported
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…ting::Test argument to the compareME function, to allow the use f HasFailure

This essentially COMPLETES the fixes for madgraph5#907 and preparatory work for madgraph5#896
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 12, 2024
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896

More in detail:
- move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating)
- move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907
  (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in)
- in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896)

Fix some clang formatting issues with respect to the last gg_tt.mad
@valassi
Copy link
Member Author

valassi commented Jul 12, 2024

This is finally understood and I have prepared a workaround in an upcoming MR.

The problem was finally due to the fact that cudaDeviceReset() may lead to destructors being called twice. In particular the structure of our google tests was such that, with two tests, the first test goes out of scope and calls cudaDeviceReset while the second test still needs to execute, and this segafults. See for instance https://stackoverflow.com/a/14610501 and https://stackoverflow.com/a/16982503.

I commented this out as a workaround. A more complete fix may imply a more significant rewrite of CUDA classes to avoid cuda API calls in destructors, but there is no guarantee that it would work. In addition, it does not really seem to be needed. I tested runTest.exe through cuda compute-sanitizer and it reports no leaks even if cudaDeviceReset is not called.

In the upcoming PR I also modified the googletest infrastructure, simplifying the use of templates, to make it easier to understand when classes are created and by whom. The functionality is totally equivalent and I find the code easier to navigate now.

@valassi
Copy link
Member Author

valassi commented Jul 12, 2024

This will be fixed by merging #909. CLosing.

@valassi valassi closed this as completed Jul 12, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…ion, removing the attempts to add two tests madgraph5#896

My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908).
I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again.

git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896

More in detail:
- move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating)
- move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907
  (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in)
- in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896)

Fix some clang formatting issues with respect to the last gg_tt.mad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant