-
Notifications
You must be signed in to change notification settings - Fork 37
API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am halfway through this PR, but I am having a big issue with build times. Moving to template FFV functions has increased the compilation time enormously. It now takes 75 minutes (1h15min) to build gCPPProcess.o through nvcc for ggttgg. (I can do this for the various none/simd in parallel, but still). The eemumu and ggtt are ok, but ggttgg is just unbearable. Using top I am told that 'cicc' is taking the large CPU time. This is clearly related to issues discussed in other places:
Note also that
I will have a quick detour on #51 and reassess RDC costs... |
|
Ok I have reached some temporary conclusion here: finally move to cuda 11.5!
At this stage, all looks good:
Now this PR needs to be completed with sevaral other points (even before moving to split kernels) |
…PEV_BRK - prepare to always include cxtype_ref
…MGONGPU_HAS_CPPCXTYPEV_BRK), move it to cx header
…HelAmps functions - all ok
… prepare to generalise to all XXX
…tions to all XXX - all ok, same performance
…yAccessRandomNumbers
…yAccessMatrixElements
…rom that for random numbers)
…ersion with reinterpret_cast<cxtype_sv*>
…XX functions - all ok, same perf
…/cuda116 at this point in time (Prepare to experiment with other compilers too)
…cc102 on ggttgg/avx2/512y, slower elsewhere!
Revert "[apiwf] test all three processes with clang12/cuda116 - faster than gcc102 on ggttgg/avx2/512y, slower elsewhere!" This reverts commit 8a7bb2d.
…sting performance, different physics? (NB even cuda116 only supports clang13, cannot use this yet through cuda)
Revert "[apiwf] test 3 procs with icx2022(based on clang14) + gcc102 - interesting performance, different physics?" This reverts commit 8d04e49.
… fully in SubProcess stuff)
…reference to `_intel_fast_memcpy'
…cal-compare warnings
… regenerate ggttgg auto
…/Makefile and mgOnGpuCxtypes.h changes
Revert "[apiwf] rerun 3 procs with icx2021(clang13+gcc102)/cuda161 - very interesting results (madgraph5#220)" This reverts commit 6350a75.
…aybe a tiny bit better
…102/cuda116 (revert) Revert "[apiwf] rerun 3 procs with gcc112/cuda161 - very similar to gcc102, maybe a tiny bit better" This reverts commit fcdf1b6. (There are several things to sort out in a furture apiwf2, but I prefer to merge this now - and move to cuda116)
|
I am going to merge this PR even if the work it contains is logically incomplete.
The above was about the limitations. What this PR does contain however is
With respect to the stuff I mentioned at the top, some todo and next steps are below To clean up
A new "first" step would then be
The second step (only as a test case to study, clearly not a production version) would remain
The third step (probably to be done together with the second) would be
I am also mertging this now because there are a couple of orthogonal issues which are higher priority
All checks have now succeded. I will merge |
This is the WIP beginning of a PR for step9 of #323, namely adding memory buffers and memory access classes for wavefunctions and amplitudes. The motivation for this is allowing to split the cuda signakin kernel into many smaller kernels for the XXX and FFV functions (also being protyped in #242), and possibly a sperate kernel for color algebra (#155 and #118).
Note that the motivation here is exclusively cuda/device, but I am defining an interface that applies also to c++. The idea is also to clean up and simplify the MemoryAccess classes, which work well but are not documented and may be an overkill (maybe easier to inline some boilerplate without going too much into variadic template overcomplications).
This is WIP. For the moment I have just
A lot of things to be done, including in random order, first step
The first step above (quite long) should provide the same performance, with a single event access as is now. That is to say, the interfaces would change a lot, but the algorithm would remain the same and the performance should saty the same.
The second step then is
The third step is then