-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in CommonRandomNumberKernel for cuda when adding a second gtest #907
Comments
After cleaning all the rest, this is it
|
…licate test Revert "[gtest/june24] Remove the duplicate test... the segfault in c++ through valgrind is still there! madgraph5#903" This reverts commit 9326c2b. Now the segfaults madgraph5#903 and issue madgraph5#906 have disappeared, but the segfault madgraph5#907 in cuda has reappeared... make BACKEND=cuda -f cudacpp.mk -j debug [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt ==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess) ==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess) ==3242005== Warning: set address range perms: large range [0x59c97000, 0xa1c96000) (noaccess) INFO: No Floating Point Exceptions have been reported ==3242005== Warning: set address range perms: large range [0x5a000000, 0xa0000000) (noaccess) [ OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (1909 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (1909 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt ==3242005== Invalid write of size 8 ==3242005== at 0x485167B: memmove (vg_replace_strmem.c:1414) ==3242005== by 0x41A72A: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431) ==3242005== by 0x41A4DB: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494) ==3242005== by 0x41A1E5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522) ==3242005== by 0x419F8D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529) ==3242005== by 0x419D4C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619) ==3242005== by 0x419990: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34) ==3242005== by 0x450287: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202) ==3242005== by 0x44CCB2: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253) ==3242005== by 0x493147: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) ==3242005== by 0x48C80C: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) ==3242005== by 0x464F3D: testing::Test::Run() (gtest.cc:2682) ==3242005== Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd ==3242005== ==3242005== ==3242005== Process terminating with default action of signal 11 (SIGSEGV) ==3242005== Access not within mapped region at address 0x2FC0F200
…tempts to duplicate tests using the current infrastructure The code is now identical to the current gtest branch for PR madgraph5#908. Will instead solve madgraph5#907 by using a much simpler test infrastructure with fewer template levels. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, try a different way to duplicate the tests, it still segfaults - will revert" This reverts commit 24e00a2. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add debug printout for test ctor/dtor" This reverts commit 0700a85. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, temporarely add back the duplicate test" This reverts commit 0cce7fb.
…to remove unnecessary levels of templating from GoogleTest. With the new implementation, successfully duplicate the madgraoh comparison test in both CPU and CUDA, avoiding segfaults madgraph5#907 There are only two issues left to solve: - First, I had to comment out some "HasFailures" which otherwise was preventing compilation - Second, the gpuDeviceReset must be moved elsewhere, in a single place for both tests INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (180 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (180 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (395 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported INFO: No Floating Point Exceptions have been reported ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped)
…tests and add back the duplication of tests - the segfault madgraph5#907 disappears if gpuDeviceReset is removed! git revert b6dcf4e git cherry-pick 0cce7fb INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt INFO: No Floating Point Exceptions have been reported [ OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (200 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (200 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt INFO: No Floating Point Exceptions have been reported [ OK ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 (181 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest (181 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (398 ms total) [ PASSED ] 4 tests.
…he segfault does reappear! THIS PROVES THAT GPUDEVICERESET CAUSES madgraph5#907 Revert "[gtest2/june24] in gg_tt.mad runTest.cc, comment out gpuDeviceReset(), this makes the previous problems disappear (but there is a leak...)" This reverts commit 503a69d. (The code is now identical to 0cce7fb) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt INFO: No Floating Point Exceptions have been reported [ OK ] SIGMA_SM_GG_TTX_GPU1/MadgraphTest.CompareMomentaAndME/0 (226 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU1/MadgraphTest (226 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU2/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt Segmentation fault (core dumped)
…he simpler gtest infrastructure, must fix the leak by calling reset> (The code is now identical to 503a69d) Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add back gpuDeviceReset(), the segfault does reappear! THIS PROVES THAT GPUDEVICERESET CAUSES madgraph5#907" This reverts commit a914fa2. Revert "[gtest2/june24] in gg_tt.mad, temporarely revert the simplication of tests and add back the duplication of tests - the segfault madgraph5#907 disappears if gpuDeviceReset is removed!" This reverts commit d154507.
… to the atexit function, but this STILL crashes! madgraph5#907 WILL THEREFORE COMMENT OUT THIS CALL... INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (179 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (179 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (393 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped)
… to avoid all crashes madgraph5#907 (FIXME? avoid cuda api calls in dtors?) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (199 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (199 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (181 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (181 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (396 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported INFO: No Floating Point Exceptions have been reported
…ting::Test argument to the compareME function, to allow the use f HasFailure This essentially COMPLETES the fixes for madgraph5#907 and preparatory work for madgraph5#896
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896 More in detail: - move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating) - move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907 (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in) - in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896) Fix some clang formatting issues with respect to the last gg_tt.mad
This is finally understood and I have prepared a workaround in an upcoming MR. The problem was finally due to the fact that cudaDeviceReset() may lead to destructors being called twice. In particular the structure of our google tests was such that, with two tests, the first test goes out of scope and calls cudaDeviceReset while the second test still needs to execute, and this segafults. See for instance https://stackoverflow.com/a/14610501 and https://stackoverflow.com/a/16982503. I commented this out as a workaround. A more complete fix may imply a more significant rewrite of CUDA classes to avoid cuda API calls in destructors, but there is no guarantee that it would work. In addition, it does not really seem to be needed. I tested runTest.exe through cuda compute-sanitizer and it reports no leaks even if cudaDeviceReset is not called. In the upcoming PR I also modified the googletest infrastructure, simplifying the use of templates, to make it easier to understand when classes are created and by whom. The functionality is totally equivalent and I find the code easier to navigate now. |
This will be fixed by merging #909. CLosing. |
…ion, removing the attempts to add two tests madgraph5#896 My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908). I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again. git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896 More in detail: - move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating) - move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907 (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in) - in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896) Fix some clang formatting issues with respect to the last gg_tt.mad
I repeat here what I had initially filed in #903. Unfortunately there were two issues
This bug report here is about the latter issue. Which is a blocker for the way I am implementing #896 two tests for channelids
I am working on adding tests for two warps with different channel #896
The working idea is to have two tests in runTests.cc (one with and one without channelids)
As a preliminary check of the infrastructure, I duplicated the existing no-multchannel test. This gives a strange and unexpected segfault in a surprising place
The code change is
The segfault is
NB after the fixes for #903 in constexpr_math.h, this only happens in the cuda code while the same code in c++ is ok.
The text was updated successfully, but these errors were encountered: