Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

valassi · 2024-06-28T17:47:07Z

This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).

This is starting from MR #873 which fixes color mappings and related issues.

The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often.
See f1e0d42

…f OMPFLAGS is set externally to an empty string (madgraph5#758)

…setting OMPFLAGS= as implemented in madgraph5#758)

… of known issues: the CI will fail

…externally to an empty string (madgraph5#758)

…ff OpenMP, to debug madgraph5#845 As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!) gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit Program received signal SIGFPE, Arithmetic exception. 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1197 if( okcol ) +where 0 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1 0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115 2 0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173 3 fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106 4 0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618 5 0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445 6 0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034 7 0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327 8 0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208 9 0x000000000042ebe0 in driver () at driver.f:257 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6 13 0x0000000000403845 in _start () +l 1192 #if defined MGONGPU_CPPSIMD 1193 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] ); 1194 #else 1195 const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] ); 1196 #endif 1197 if( okcol ) 1198 { 1199 allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1] 1200 break; 1201 } +p okcol $1 = <optimized out>

valassi · 2024-06-28T19:30:17Z

Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872.

I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests.

…adgraph5#845) by adding 'volatile', again This no longer crashes gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit

… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again

…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)

…g of known issues: the CI will succeed Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail" This reverts commit a5b7c42.

valassi · 2024-06-28T19:57:36Z

Hi @oliviermattelaer yet another one (we are getting there...).

I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it).

Can you please review?
NB This is based on #873, so I would FIRST review and merge that, and THEN review this (so it becomes easier to see only the additional differences).

Thanks!
Andrea

…raph5#873) into fpe

STARTED AT Sat Jun 29 01:07:28 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Jun 29 02:40:30 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sat Jun 29 02:55:58 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sat Jun 29 03:03:51 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sat Jun 29 03:06:32 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sat Jun 29 03:09:12 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Sat Jun 29 03:11:56 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Sat Jun 29 03:23:34 PM CEST 2024 [Status=0]

…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed) Note two points: - gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed - the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet STARTED AT Sat Jun 29 03:23:34 PM CEST 2024 (SM tests) ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

valassi · 2024-07-01T07:33:53Z

I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge mg5amcnlo/mg5amcnlo#115 and #873)

valassi · 2024-07-01T14:18:56Z

Putting this back to draft for the moment until we agree on #873 and #877

… merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)

…er merging git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)

…adgraph5#852 and color2 PR madgraph5#877, instead of color PR madgraph5#873) into fpe

valassi · 2024-07-03T13:36:05Z

I have merged the latest upstream/master (including PR #852 and PR #877) into this and marked it again as ready for review.

…easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)

… and madgraph5#881 fixes for madgraph5#856) into fpe

… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)

valassi · 2024-07-04T13:52:09Z

Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged

it fixes an intermittent segfault (Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845), again with a 'volatile', but this is very clearly a SIMD AVX512 issue
and it allows no-OpenMP builds by setting 'export OMPFLAGS=', which is more or less what you had requested (Why do we need to compile with openmp flag? #758)

Can you please review? (when you have time, not urgent)

Thanks
Andrea

oliviermattelaer · 2024-07-04T14:57:36Z

Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that).

Thanks a lot,

Olivier

valassi · 2024-07-04T15:01:29Z

Very good, thanks Olivier!

I will merge this then, and reopen #758

Note, there are 6 errors in the CI but they are the usual failing ones

…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds

valassi added 3 commits June 28, 2024 19:31

[fpe] in CODEGEN cudacpp.mk, add the option to build without OpenMP i…

cb67911

…f OMPFLAGS is set externally to an empty string (madgraph5#758)

[fpe] regenerate all processes, with the option to disable OpenMP

d4ffbbe

[fpe] in testsuite_oneprocess.sh, disable OpenMP in the CI tests (by …

3758760

…setting OMPFLAGS= as implemented in madgraph5#758)

valassi requested a review from a team as a code owner June 28, 2024 17:47

valassi marked this pull request as draft June 28, 2024 17:47

valassi added 3 commits June 28, 2024 20:37

[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing…

a5b7c42

… of known issues: the CI will fail

[fpe] in tput and tmad, disable OpenMP in manual tests: set OMPFLAGS …

afa50dc

…externally to an empty string (madgraph5#758)

valassi added 5 commits June 28, 2024 21:47

[fpe] in CODEGEN (backport from gq_ttq.mad), fix the SIGFPE crash for…

8d700fd

… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again

[fpe] in CODEGEN cudacpp.mk, add the option to hardcode away OMP

f805e8d

[fpe] regenerate all processes - including the 'volatile' fix for SIG…

e210201

…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)

[fpe] in .github/workflows/testsuite_oneprocess.sh, reenable bypassin…

d67ab23

…g of known issues: the CI will succeed Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail" This reverts commit a5b7c42.

valassi mentioned this pull request Jun 28, 2024

Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845

Closed

This was linked to issues Jun 28, 2024

Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845

Closed

Why do we need to compile with openmp flag? #758

Closed

valassi changed the title ~~WIP: allow no-OpenMP builds and debug intermittent FPEs~~ Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and Jun 28, 2024

valassi marked this pull request as ready for review June 28, 2024 19:53

valassi requested a review from oliviermattelaer June 28, 2024 19:53

valassi self-assigned this Jun 28, 2024

valassi mentioned this pull request Jun 28, 2024

No cross section in SUSY gg_t1t1 log file #826

Closed

valassi added 4 commits June 29, 2024 13:00

Merge branch 'color' (including the latest tput/tmad logs for PR madg…

f0598a1

…raph5#873) into fpe

[fpe] regenerate all processes again, check all is ok

005b228

valassi marked this pull request as draft July 1, 2024 14:16

valassi added 4 commits July 3, 2024 15:25

[fpe] move to CODEGEN logs from the latest upstream/master for easier…

b0d1bff

… merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)

[fpe] move to tput/tmad logs from the latest upstream/master for easi…

72b7dc0

…er merging git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)

Merge remote-tracking branch 'upstream/master' (including fix_826 PR m…

7f763b4

…adgraph5#852 and color2 PR madgraph5#877, instead of color PR madgraph5#873) into fpe

[fpe] regenerate all processes - check that all is ok

5e528f9

valassi changed the title ~~Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and~~ Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds Jul 3, 2024

valassi marked this pull request as ready for review July 3, 2024 13:35

valassi added 3 commits July 4, 2024 15:37

[fpe] move again to CODEGEN logs from the latest upstream/master for …

f04fe6a

…easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)

Merge remote-tracking branch 'upstream/master' (including madgraph5#880…

8199fdf

… and madgraph5#881 fixes for madgraph5#856) into fpe

[fpe] ** COMPLETE FPE (again) ** regenerate all processes again, just…

c88d62a

… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)

valassi merged commit c5805fe into madgraph5:master Jul 4, 2024
163 of 169 checks passed

valassi mentioned this pull request Jul 4, 2024

Why do we need to compile with openmp flag? #758

Closed

valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024

[june24] regenerate gg_tt.mad after cherry-picking the relevant bits …

09d203c

…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

valassi commented Jun 28, 2024

valassi commented Jun 28, 2024

valassi commented Jun 28, 2024

valassi commented Jul 1, 2024

valassi commented Jul 1, 2024

valassi commented Jul 3, 2024

valassi commented Jul 4, 2024 •

edited

Loading

oliviermattelaer commented Jul 4, 2024

valassi commented Jul 4, 2024

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Conversation

valassi commented Jun 28, 2024

valassi commented Jun 28, 2024

valassi commented Jun 28, 2024

valassi commented Jul 1, 2024

valassi commented Jul 1, 2024

valassi commented Jul 3, 2024

valassi commented Jul 4, 2024 • edited Loading

oliviermattelaer commented Jul 4, 2024

valassi commented Jul 4, 2024

valassi commented Jul 4, 2024 •

edited

Loading