-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874
Conversation
…f OMPFLAGS is set externally to an empty string (madgraph5#758)
…setting OMPFLAGS= as implemented in madgraph5#758)
… of known issues: the CI will fail
…externally to an empty string (madgraph5#758)
…ff OpenMP, to debug madgraph5#845 As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!) gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit Program received signal SIGFPE, Arithmetic exception. 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1197 if( okcol ) +where 0 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1 0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115 2 0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173 3 fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106 4 0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618 5 0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445 6 0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034 7 0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327 8 0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208 9 0x000000000042ebe0 in driver () at driver.f:257 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6 13 0x0000000000403845 in _start () +l 1192 #if defined MGONGPU_CPPSIMD 1193 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] ); 1194 #else 1195 const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] ); 1196 #endif 1197 if( okcol ) 1198 { 1199 allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1] 1200 break; 1201 } +p okcol $1 = <optimized out>
Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872. I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests. |
…adgraph5#845) by adding 'volatile', again This no longer crashes gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)
…g of known issues: the CI will succeed Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail" This reverts commit a5b7c42.
Hi @oliviermattelaer yet another one (we are getting there...). I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993 En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it). Can you please review? Thanks! |
STARTED AT Sat Jun 29 01:07:28 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Jun 29 02:40:30 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sat Jun 29 02:55:58 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sat Jun 29 03:03:51 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sat Jun 29 03:06:32 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sat Jun 29 03:09:12 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Sat Jun 29 03:11:56 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Sat Jun 29 03:23:34 PM CEST 2024 [Status=0]
…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed) Note two points: - gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed - the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet STARTED AT Sat Jun 29 03:23:34 PM CEST 2024 (SM tests) ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge mg5amcnlo/mg5amcnlo#115 and #873) |
… merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…er merging git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
…adgraph5#852 and color2 PR madgraph5#877, instead of color PR madgraph5#873) into fpe
…easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
… and madgraph5#881 fixes for madgraph5#856) into fpe
… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)
Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged
Can you please review? (when you have time, not urgent) Thanks |
Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that). Thanks a lot, Olivier |
Very good, thanks Olivier! I will merge this then, and reopen #758 Note, there are 6 errors in the CI but they are the usual failing ones |
…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds
This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).
This is starting from MR #873 which fixes color mappings and related issues.
The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often.
See f1e0d42