Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(DO NOT MERGE: for reference only) Further tests of Olivier's "fix_826" branch PR 852 #870

Closed
wants to merge 19 commits into from

Conversation

valassi
Copy link
Member

@valassi valassi commented Jun 26, 2024

his contains further tests of Olivier's "fix_826" branch PR #852. It does not look good. I will discuss the results in PR #852 directly.

oliviermattelaer and others added 13 commits May 31, 2024 16:47
…ot channelId (and note that iconfig=1 is ok)
… test a different iconfig

In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?)
  ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean

This also triggers a similar SIGFPE (initially reported in madgraph5#826)
  ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#852

The gg_ttgg test still crashes (rotxxx madgraph5#855?)
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fce5ec23860 in ???
   1  0x7fce5ec22a05 in ???
   2  0x7fce5e854def in ???
   3  0x44b5ff in ???
   4  0x4087df in ???
   5  0x409848 in ???
   6  0x40bb83 in ???
   7  0x40d1a9 in ???
   8  0x45c804 in ???
   9  0x434269 in ???
   10  0x40371e in ???
   11  0x7fce5e83feaf in ???
   12  0x7fce5e83ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f9f03423860 in ???
   1  0x7f9f03422a05 in ???
   2  0x7f9f03054def in ???
   3  0x43809f in ???
   4  0x40581f in ???
   5  0x4067b1 in ???
   6  0x408c71 in ???
   7  0x40a0a9 in ???
   8  0x444fdf in ???
   9  0x42bb38 in ???
   10  0x40371e in ???
   11  0x7f9f0303feaf in ???
   12  0x7f9f0303ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?)
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fbafa623860 in ???
   1  0x7fbafa622a05 in ???
   2  0x7fbafa254def in ???
   3  0x7fbafad24034 in ???
   4  0x7fbafa9a1575 in ???
   5  0x7fbafad20c89 in ???
   6  0x7fbafad2abfd in ???
   7  0x7fbafad30491 in ???
   8  0x43008b in ???
   9  0x431c10 in ???
   10  0x432d47 in ???
   11  0x433b1e in ???
   12  0x44a921 in ???
   13  0x42ebbf in ???
   14  0x40371e in ???
   15  0x7fbafa23feaf in ???
   16  0x7fbafa23ff5f in ???
   17  0x403844 in ???
   18  0xffffffffffffffff in ???
  ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
…nd cudacpp.mk to improve the crash dumps

The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb7e1223860 in ???
   1  0x7fb7e1222a05 in ???
   2  0x7fb7e0e54def in ???
   3  0x43809f in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x40581f in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480
   5  0x4067b1 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167
   6  0x408c71 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68
   7  0x40a0a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60
   8  0x444fdf in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172
   9  0x42bb38 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
  ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed

The ggttgg test also clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb141c23860 in ???
   1  0x7fb141c22a05 in ???
   2  0x7fb141854def in ???
   3  0x44b5ff in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x4087df in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
   5  0x409848 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
   6  0x40bb83 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
   7  0x40d1a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
   8  0x45c804 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
   9  0x434269 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
  ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

The gqttq test instead clearly crashes in sigmaKin (madgraph5#845):
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f607ee23860 in ???
   1  0x7f607ee22a05 in ???
   2  0x7f607ea54def in ???
   3  0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190
   4  0x7f607f4ab575 in ???
   5  0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093
   6  0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
   7  0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
   8  0x7f607f613491 in fbridgesequence_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
   9  0x43008b in smatrix1_multi_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
   10  0x431c10 in dsig1_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
   11  0x432d47 in dsigproc_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
   12  0x433b1e in dsig_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
   13  0x44a921 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
   14  0x42ebbf in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
   15  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
  ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

Conclusion: I would not merge 852 as it does not fix issues yet.
Instead I would merge 857 to fix the rotxxx crash 855 using volatile, and reassess from there...
@valassi valassi self-assigned this Jun 26, 2024
@valassi valassi marked this pull request as draft June 26, 2024 13:41
Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

I merged the latest upstream/master into this, including the CI tmad tests.

This now has 24 failures, which should be (I did not check each one of them):

…test upstream/master (after cherry-picking the madX.sh changes too)

GITMB=$(git merge-base --fork-point upstream/master HEAD)
echo $GITMB
  a87e640
git checkout $GITMB $(git ls-tree --name-only $GITMB */CODEGEN*txt)
Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

FWIW I upgraded this to the latest upstream/master. But again this will not be merged.

Note: on upstream/master (before adding these extra changes in PR 870) there were 9 errors in the CI, three fptype for each of the following three issues, #857 (comment)

  • no cross section (826) in susy_gg_t1t1
  • LHE color mismatch (856) in gg_ttgg
  • xsec discrepancy (872) in pp_tt012j

With these extra patches in 870, there are 22 CI failures in https://github.com/madgraph5/madgraph4gpu/actions/runs/9704073463 . These seem to be the following

  • 3x no cross section (826) in susy_gg_t1t1
  • 3x LHE color mismatch (856) in gg_ttgg
  • 3x (new?) LHE color mismatch in pp_tt012j
  • 10 build failures for fptype=m
  • (plus 3x GPU failures in the old CI tests)

So maybe these patches changed something in color mapping, but actually broke them further in pp_tt012j? Not sure, anyway, will investigate more

@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

I am closing this. This is superseded by MR #873, providing several fixes in this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants