-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jorgen's addition of HIP/CUDA abstraction to codegen and all processes (with extra fixes and tests: "jt774" replacing PR #774) #801
Conversation
…rc.mk and cudacpp.mk
…se HIP else neither" in CODEGEN cudacpp.mk
Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/check_sa.cc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/cudacpp.mk epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/cudacpp_src.mk epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/mgOnGpuConfig.h epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/output.py
…ce % by %% (code generation was failing)
…EN (code generation was failing clang formatting checks)
CUDA_HOME=none HIP_HOME=none make |& more ... ccache g++ -O3 -std=c++17 -I. -fPIC -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOU BLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HAS_NO_CURAND -fPIC -c Parameters_sm.cc -o Parameters_sm.o In file included from /usr/include/c++/11/locale:41, from /usr/include/c++/11/iomanip:43, from Parameters_sm.cc:17: /usr/include/c++/11/bits/locale_facets_nonio.h:59:39: error: ‘locale’ has not been declared 59 | struct __timepunct_cache : public locale::facet | ^~~~~~
This reverts commit def02b5.
…ors.h and process_matrix.inc as in branch jthip24 These are changes that in that branch I included in commitcommit 6e90139 (Tue Jul 18 18:25:34 2023 +0200), which consisted in a backport to CODEGEN of earlier changes in ggttggg.mad.
…GONGPUCPP_GPUIMPL... not clear why this was not done yet In branch jthip24, this is coming from Jorgen's commit 6741186 (Thu Jul 13 15:15:41 2023 +0200) which includes many such changes
…ggttgg.mad on Tue Jul 18 18:11:04 2023 +0200) Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/cpp_model_parameters_h.inc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/check_sa.cc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/mgOnGpuConfig.h NB: this is very strange, because this same commit 1b5c0fd is already included in the jt774 branch earlier on...
So for me this simple madgraph test is not working anymore:
crash is
Does it work for you andrea? |
I was using gpucpp_wrap, indeed gpucpp is working. Let me check with that one that SIMD is working now on my mac |
Hi @oliviermattelaer thanks yes I am using gpucpp branch. Or more precisely: I am using whatever mg5amcnlo commit we have in the module: This is a rather older version of gpucpp. If you want, I can also try to include the upgrade to the latest gpucpp here. But first check with 23f61b9 please... again, I'd avoid mixing everything together if possible (it's ok if it works, but if it does not work I would rather debug the issues separately). Thanks! |
PS 23f61b9 is also what we have now in master |
I can not test with that MG5aMC version since my python version is too recent for that "old" version of MG5aMC... |
Ok, it does not work for me actually. |
note for later, we will need to re-run the test on that commit since the reported crash is an github API issue (i.e. not related to the commit per say) |
…ted before Olivier's commit] STARTED AT Tue Jan 30 01:27:55 AM CET 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Tue Jan 30 05:12:08 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Tue Jan 30 05:41:46 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Tue Jan 30 05:52:05 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Tue Jan 30 05:55:35 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Tue Jan 30 05:59:00 AM CET 2024 [Status=0]
…ted before Olivier's commit] STARTED AT Tue Jan 30 06:02:30 AM CET 2024 ENDED AT Tue Jan 30 10:37:48 AM CET 2024 Status=0 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
Hi @oliviermattelaer thanks for testing and for adding the fix 9ed3aaf As you saw I opened PR #811 in parallel about the update of mg5amcnlo to the latest gpucpp. But I suggest we postpone that, and we use the older gpucpp for this PR #801. And yes we do too much for github and its CI ;-) |
….py changes, while c++/cuda/hip is unchanged
I have check the plugin/interface part and this is good to merge for me. However, this would need to add a new backend within launch_plugin.py I would say that our target here would be to be able to run via the following script:
(maybe a good time to change the name of that variable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me change the flag to approve, since my "requested change" are not strictly necessary for us to merge this (and in that case, I will create a new branch that will implement such change and ask you the check them)
…qttq (madgraph5#806) (1) Step 1 - build on the login node (almost 24 hours!) STARTED AT Tue 30 Jan 2024 02:27:18 AM EET ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -makeonly ENDED(1) AT Wed 31 Jan 2024 12:32:21 AM EET [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -makeonly ENDED(2) AT Wed 31 Jan 2024 01:01:06 AM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -makeonly ENDED(3) AT Wed 31 Jan 2024 01:13:56 AM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -makeonly ENDED(4) AT Wed 31 Jan 2024 01:16:06 AM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rorhst -makeonly ENDED(5) AT Wed 31 Jan 2024 01:18:44 AM EET [Status=0] (2) Step 2 - run tests on the worker node (less than 2 hours) NB this is "./tput/allTees.sh" WITHOUT the -hip flag (no "-rorhst" added) STARTED AT Wed 31 Jan 2024 01:16:39 PM EET ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Wed 31 Jan 2024 02:09:05 PM EET [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Wed 31 Jan 2024 02:26:12 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Wed 31 Jan 2024 02:45:10 PM EET [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Wed 31 Jan 2024 02:48:54 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Wed 31 Jan 2024 02:51:15 PM EET [Status=0] ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:ERROR! Fortran calculation (F77/CUDA) crashed ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Backtrace for this error: ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:ERROR! Fortran calculation (F77/CUDA) crashed
…qttq (madgraph5#806) NB this is "./tmad/allTees.sh" WITHOUT the -hip flag (no "-rorhst" added) STARTED AT Wed 31 Jan 2024 02:54:59 PM EET ENDED AT Wed 31 Jan 2024 06:02:10 PM EET Status=0 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
…d90 before merging to master
Thanks Olivier, I will merge. I committed a few more manual test logs, all OK. And the CI is good. After this merge 801, I would suggest going in the following order:
Also, later on,
And then later all the other non hip stuff (warp size, more on makefiles, etc). |
And thanks @Jooorgen again! :-) |
…are to merg eupstream/master with HIP madgraph5#801 git checkout 0dc3d50~ $(git ls-tree --name-only HEAD */CODEGEN*txt)
… from PR madgraph5#801) into gpucpp
…the mg5amcnlo update: no changes except in codegen logs (changes in individual processes have been merged already)
…raph5#801 and gpucpp PR madgraph5#811) into rocrand Fix conflicts here (plus some in gg_tt.mad fixed by checking out rocrand version) epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/check_sa.cc epochX/cudacpp/tput/allTees.sh epochX/cudacpp/tput/throughputX.sh
…raph5#801 and gpucpp PR madgraph5#811, and possibly more) into mch
…nd maybe more) ** rerun 18 tmad tests on itscrd90, all ok STARTED AT Sat Feb 3 07:02:02 PM CET 2024 ENDED AT Sat Feb 3 11:20:09 PM CET 2024 Status=0 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
…dgraph5#801 and gpucpp PR madgraph5#811) into makefiles Fix conflicts in epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/cudacpp.mk
Hi @oliviermattelaer @roiser @nscottnichols as discussed this is the WIP PR replacing Jorgen's PR #774.
It includes all of Jorgen's changes for HIP in PR #774 (GpuAbstraction.h and hipcc build instructions) in CODEGEN, plus a merge of the current upstream/master and extra fixes that were necessary to fix codegen and/or CUDA/C++ builds. Those extra fixes were partly derived from earlier work I had done with Jorgen in July/August (that I kept for reference in a PR #800, that I opened and immediately closed).
This PR has a fully functioning CODEGEN and regerenated processes, with baisc builds tested for ggtt.mad, but it is still in WIP.
I will close PR #774 because it is replaced by this one.