Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Merged
merged 22 commits into from
Jul 4, 2024

Conversation

valassi
Copy link
Member

@valassi valassi commented Jun 28, 2024

This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).

This is starting from MR #873 which fixes color mappings and related issues.

The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often.
See f1e0d42

@valassi valassi requested a review from a team as a code owner June 28, 2024 17:47
@valassi valassi marked this pull request as draft June 28, 2024 17:47
…ff OpenMP, to debug madgraph5#845

As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!)

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
1197                if( okcol )
+where
 0  0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
 1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115
 2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
 3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106
 4  0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618
 5  0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
 6  0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
 7  0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327
 8  0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208
 9  0x000000000042ebe0 in driver () at driver.f:257
 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
 13 0x0000000000403845 in _start ()
+l
1192    #if defined MGONGPU_CPPSIMD
1193                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1194    #else
1195                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1196    #endif
1197                if( okcol )
1198                {
1199                  allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1200                  break;
1201                }
+p okcol
$1 = <optimized out>
@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872.

I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests.

…adgraph5#845) by adding 'volatile', again

This no longer crashes

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)
…g of known issues: the CI will succeed

Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail"
This reverts commit a5b7c42.
@valassi valassi changed the title WIP: allow no-OpenMP builds and debug intermittent FPEs Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and Jun 28, 2024
@valassi valassi marked this pull request as ready for review June 28, 2024 19:53
@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

Hi @oliviermattelaer yet another one (we are getting there...).

I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it).

Can you please review?
NB This is based on #873, so I would FIRST review and merge that, and THEN review this (so it becomes easier to see only the additional differences).

Thanks!
Andrea

STARTED  AT Sat Jun 29 01:07:28 PM CEST 2024
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sat Jun 29 02:40:30 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Sat Jun 29 02:55:58 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Sat Jun 29 03:03:51 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Sat Jun 29 03:06:32 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Sat Jun 29 03:09:12 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common
ENDED(6) AT Sat Jun 29 03:11:56 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Sat Jun 29 03:23:34 PM CEST 2024 [Status=0]
…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed)

Note two points:
- gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed
- the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet

STARTED  AT Sat Jun 29 03:23:34 PM CEST 2024
(SM tests)
ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
@valassi
Copy link
Member Author

valassi commented Jul 1, 2024

I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge mg5amcnlo/mg5amcnlo#115 and #873)

@valassi valassi marked this pull request as draft July 1, 2024 14:16
@valassi
Copy link
Member Author

valassi commented Jul 1, 2024

Putting this back to draft for the moment until we agree on #873 and #877

… merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…er merging

git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
@valassi valassi changed the title Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds Jul 3, 2024
@valassi valassi marked this pull request as ready for review July 3, 2024 13:35
@valassi
Copy link
Member Author

valassi commented Jul 3, 2024

I have merged the latest upstream/master (including PR #852 and PR #877) into this and marked it again as ready for review.

…easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)
@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged

Can you please review? (when you have time, not urgent)

Thanks
Andrea

@oliviermattelaer
Copy link
Member

Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that).

Thanks a lot,

Olivier

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

Very good, thanks Olivier!

I will merge this then, and reopen #758

Note, there are 6 errors in the CI but they are the usual failing ones
image

@valassi valassi merged commit c5805fe into madgraph5:master Jul 4, 2024
163 of 169 checks passed
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants