Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

valgrind issues #868

Closed
valassi opened this issue Jun 25, 2024 · 7 comments · Fixed by #869
Closed

valgrind issues #868

valassi opened this issue Jun 25, 2024 · 7 comments · Fixed by #869
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jun 25, 2024

Since we are not yet converging on some issues like the rotxxx segfault #855 and the possible fix with volatile in #857, in parallel I am also running some checks with valgrind.

On the current upstream/master 286280f, using the same code I used for the #855 reproducer described here, #857 (comment) , I try this valgrind, to start with on Fortran ONLY.

cd madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -f cudacpp.mk gtestlibs
make -j BACKEND=cppnone -f cudacpp.mk debug
make -j BACKEND=cppnone
cat > input_cudacpp_104 << EOF
32 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp

Note that 32 events are enough, the cpp still crashes.

I do this valgrind

valgrind --leak-check=full --gen-suppressions=all --log-file=memcheck.log ./madevent_fortran < input_cudacpp_104

And this tells me

more memcheck.log 
==3678768== Memcheck, a memory error detector
==3678768== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3678768== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3678768== Command: ./madevent_fortran
==3678768== Parent PID: 3638452
==3678768== 
==3678768== Conditional jump or move depends on uninitialised value(s)
==3678768==    at 0x425E73: setclscales_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg
_ttxgg/madevent_fortran)
==3678768==    by 0x4284D9: update_scale_coupling_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubP
rocesses/P1_gg_ttxgg/madevent_fortran)
==3678768==    by 0x436B87: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_tt
xgg/madevent_fortran)
==3678768==    by 0x45AFAA: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg
_ttxgg/madevent_fortran)
==3678768==    by 0x4331AD: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
/madevent_fortran)
==3678768==    by 0x40268E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/m
adevent_fortran)
==3678768== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:setclscales_
   fun:update_scale_coupling_vec_
   fun:dsig_vec_
   fun:sample_full_
   fun:MAIN__
   fun:main
}
==3678768== 
==3678768== HEAP SUMMARY:
==3678768==     in use at exit: 552 bytes in 3 blocks
==3678768==   total heap usage: 137,537 allocs, 137,534 frees, 48,779,094 bytes allocated
==3678768== 
==3678768== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3678768==    at 0x719786F: malloc (vg_replace_malloc.c:381)
==3678768==    by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x47AA9F: open_file_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_t
txgg/madevent_fortran)
==3678768==    by 0x432C91: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
/madevent_fortran)
==3678768==    by 0x40268E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/m
adevent_fortran)
==3678768== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   fun:_gfortran_st_open
   fun:open_file_
   fun:MAIN__
   fun:main
}
==3678768== LEAK SUMMARY:
==3678768==    definitely lost: 32 bytes in 1 blocks
==3678768==    indirectly lost: 512 bytes in 1 blocks
==3678768==      possibly lost: 0 bytes in 0 blocks
==3678768==    still reachable: 8 bytes in 1 blocks
==3678768==         suppressed: 0 bytes in 0 blocks
==3678768== Reachable blocks (those to which a pointer was found) are not shown.
==3678768== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3678768== 
==3678768== Use --track-origins=yes to see where uninitialised values come from
==3678768== For lists of detected and suppressed errors, rerun with: -s
==3678768== ERROR SUMMARY: 26058 errors from 2 contexts (suppressed: 0 from 0)
@valassi
Copy link
Member Author

valassi commented Jun 25, 2024

Adding -g to GLOBAL_FLAGS gives tiny more details

==3682257== Memcheck, a memory error detector
==3682257== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3682257== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3682257== Command: ./madevent_fortran
==3682257== Parent PID: 3638452
==3682257== 
==3682257== Conditional jump or move depends on uninitialised value(s)
==3682257==    at 0x425E73: setclscales_ (reweight.f:1230)
==3682257==    by 0x4284D9: update_scale_coupling_vec_ (reweight.f:1876)
==3682257==    by 0x436B87: dsig_vec_ (auto_dsig.f:316)
==3682257==    by 0x45AFAA: sample_full_ (dsample.f:208)
==3682257==    by 0x4331AD: MAIN__ (driver.f:256)
==3682257==    by 0x40268E: main (driver.f:301)
==3682257== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:setclscales_
   fun:update_scale_coupling_vec_
   fun:dsig_vec_
   fun:sample_full_
   fun:MAIN__
   fun:main
}
==3682257== 
==3682257== HEAP SUMMARY:
==3682257==     in use at exit: 552 bytes in 3 blocks
==3682257==   total heap usage: 137,537 allocs, 137,534 frees, 48,779,094 bytes allocated
==3682257== 
==3682257== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3682257==    at 0x719786F: malloc (vg_replace_malloc.c:381)
==3682257==    by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x47AA9F: open_file_ (open_file.f:40)
==3682257==    by 0x432C91: MAIN__ (driver.f:151)
==3682257==    by 0x40268E: main (driver.f:301)
==3682257== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   fun:_gfortran_st_open
   fun:open_file_
   fun:MAIN__
   fun:main
}
==3682257== LEAK SUMMARY:
==3682257==    definitely lost: 32 bytes in 1 blocks
==3682257==    indirectly lost: 512 bytes in 1 blocks
==3682257==      possibly lost: 0 bytes in 0 blocks
==3682257==    still reachable: 8 bytes in 1 blocks
==3682257==         suppressed: 0 bytes in 0 blocks
==3682257== Reachable blocks (those to which a pointer was found) are not shown.
==3682257== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3682257== 
==3682257== Use --track-origins=yes to see where uninitialised values come from
==3682257== For lists of detected and suppressed errors, rerun with: -s
==3682257== ERROR SUMMARY: 26058 errors from 2 contexts (suppressed: 0 from 0)

@valassi
Copy link
Member Author

valassi commented Jun 25, 2024

The leak is reported in mg5amcnlo/mg5amcnlo#109
A fix for the leak is in mg5amcnlo/mg5amcnlo#110

@valassi valassi self-assigned this Jun 25, 2024
@valassi
Copy link
Member Author

valassi commented Jun 25, 2024

(Note: this is one example of #207 about testing code through valgrind)

@valassi
Copy link
Member Author

valassi commented Jun 25, 2024

The uninitialised value is reported in mg5amcnlo/mg5amcnlo#111
A workaround (not a real fix) is in mg5amcnlo/mg5amcnlo#112

After applying both patches above, there are now no valgrind issues in madevent_fortran

==3735418== Memcheck, a memory error detector
==3735418== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3735418== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3735418== Command: ./madevent_fortran
==3735418== Parent PID: 3638452
==3735418== 
==3735418== 
==3735418== HEAP SUMMARY:
==3735418==     in use at exit: 8 bytes in 1 blocks
==3735418==   total heap usage: 141,695 allocs, 141,694 frees, 50,553,628 bytes allocated
==3735418== 
==3735418== LEAK SUMMARY:
==3735418==    definitely lost: 0 bytes in 0 blocks
==3735418==    indirectly lost: 0 bytes in 0 blocks
==3735418==      possibly lost: 0 bytes in 0 blocks
==3735418==    still reachable: 8 bytes in 1 blocks
==3735418==         suppressed: 0 bytes in 0 blocks
==3735418== Rerun with --leak-check=full to see details of leaked memory
==3735418== 
==3735418== For lists of detected and suppressed errors, rerun with: -s
==3735418== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Note, the rotxxx crash is still there instead in madevent_cpp

gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp
...
Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
#0  rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1  0x00000000004087e0 in gentcms (pa=..., pb=..., t=-181765.47706865534, phi=0.64468537567405615, ma2=0, m1=234.1712866912786, 
    m2=210.15563843880372, p1=..., pr=..., jac=3.0327734872026782e+25) at genps.f:1480
#2  0x0000000000409849 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=104, nbranch=4, p=..., m=..., s=..., x=..., 
    jac=3.0327734872026782e+25, pswgt=1) at genps.f:1167
#3  0x000000000040bb84 in gen_mom (iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p1=...) at genps.f:68
#4  0x000000000040d1aa in x_to_f_arg (ndim=10, iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p=...)
    at genps.f:60
#5  0x000000000045c865 in sample_full (ndim=10, ncall=32, itmax=1, itmin=1, dsig=0x438b00 <dsig>, ninvar=10, nconfigs=1, 
    vecsize_used=16384) at dsample.f:172
#6  0x000000000043427a in driver () at driver.f:257
#7  0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
#8  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()

I will now check valgrind also on madeven_cpp

@valassi
Copy link
Member Author

valassi commented Jun 25, 2024

I have tried to run valgrind on madevent_cpp but this hangs...

valgrind --track-origins=yes --gen-suppressions=all --max-stackframe=3932984 --log-file=memcheckc2.log ./madevent_cpp < input_cudacpp_104

After 200 minutes (more than 3 hours) this was still running...

==3737758== Process terminating with default action of signal 2 (SIGINT)
==3737758==    at 0x6E692E5: KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessRecord(double*) (MemoryAccessHelpers.h:116)
==3737758==    by 0x6E59E11: double& KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessField<int>(double*, int) (MemoryAccessHelpers.h:139)
==3737758==    by 0x6E6A02D: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2(double*, int) (MemoryAccessCouplings.h:185)
==3737758==    by 0x6E69D99: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2Const(double const*, int) (MemoryAccessCouplings.h:205)
==3737758==    by 0x6E694CB: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessConst(double const*) (MemoryAccessCouplings.h:256)
==3737758==    by 0x6E662B9: void mg5amcCpu::FFV1_0<mg5amcCpu::KernelAccessWavefunctions<false>, mg5amcCpu::KernelAccessAmplitudes<false>, mg5amcCpu::KernelAccessCouplings<false> >(double const*, double const*, double const*, double const*, double, double*) (HelAmps_sm.h:1104)
==3737758==    by 0x6E485A4: mg5amcCpu::calculate_wavefunctions(int, double const*, double const*, double*, unsigned int, double*, double*, double*, int) (CPPProcess.cc:1174)
==3737758==    by 0x6E5914F: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) [clone ._omp_fn.0] (CPPProcess.cc:3223)
==3737758==    by 0x748F575: GOMP_parallel (in /usr/lib64/libgomp.so.1.0.0)
==3737758==    by 0x6E58BE2: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) (CPPProcess.cc:3203)
==3737758==    by 0x6E6B371: mg5amcCpu::MatrixElementKernelHost::computeMatrixElements(unsigned int) (MatrixElementKernels.cc:115)
==3737758==    by 0x6E6DAFB: mg5amcCpu::Bridge<double>::cpu_sequence(double const*, double const*, double const*, double const*, unsigned int, double*, int*, int*, bool) (Bridge.h:390)

I will abandon this line of tests forthe moment

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…grind leak warning madgraph5#868

==3682257== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3682257==    at 0x719786F: malloc (vg_replace_malloc.c:381)
==3682257==    by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x47AA9F: open_file_ (open_file.f:40)
==3682257==    by 0x432C91: MAIN__ (driver.f:151)
==3682257==    by 0x40268E: main (driver.f:301)
==3682257==
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…clscales (workaround for valgrind undefined behaviour madgraph5#868)

==3687827== Conditional jump or move depends on uninitialised value(s)
==3687827==    at 0x425E73: setclscales_ (reweight.f:1230)
==3687827==    by 0x4284D9: update_scale_coupling_vec_ (reweight.f:1876)
==3687827==    by 0x436BC7: dsig_vec_ (auto_dsig.f:316)
==3687827==    by 0x45AFEA: sample_full_ (dsample.f:208)
==3687827==    by 0x4331AD: MAIN__ (driver.f:257)
==3687827==    by 0x40268E: main (driver.f:302)
==3687827==  Uninitialised value was created by a stack allocation
==3687827==    at 0x424051: setclscales_ (reweight.f:555)
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

A couple of valgrind issues (running vaglrind over fortran) have been addressed.
They are merged in PR #869.

The issue remains that running valgrind over the cudacpp madevent instead hangs. This should eventually be opened in a new issue.

I am closing this for simplicity as the most urgen issus have gone.

@valassi valassi closed this as completed Jun 27, 2024
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

PS A couple of comments from extra tests

  • I do see sometimes a lot of invalid reads and writes in addresses created by dsig1_vec, but this ONLY happens if I also get "client is switching stacks" warnings. If I do add "--max-stackframe", then these errors disappear. So I tend to think that they are spurious errors.
  • If I try to run valgrind on an AVX512 build, this fails (I get my message that AVX512 is not supported, which probably describes the virtual environment inside valgrind). This is well known: https://bugs.kde.org/show_bug.cgi?id=383010
  • I HAVE managed to run a madevent_cpp through valgrind and get it to complete: this was WITHOUT "-g" (very strangely, the "-g" builds eventually lead to a hang??). Anyway: the important thing is that I saw NO ERROR FROM VALGRIND ON MADEVENT_CPP. (This is after fixing rotxxx with volatile, and also using max stackframe; and using a cppnone no-SIMD build).

So I would say that I completed this valgrind investigation for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant