API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328

valassi · 2022-01-12T08:48:21Z

This is the WIP beginning of a PR for step9 of #323, namely adding memory buffers and memory access classes for wavefunctions and amplitudes. The motivation for this is allowing to split the cuda signakin kernel into many smaller kernels for the XXX and FFV functions (also being protyped in #242), and possibly a sperate kernel for color algebra (#155 and #118).

Note that the motivation here is exclusively cuda/device, but I am defining an interface that applies also to c++. The idea is also to clean up and simplify the MemoryAccess classes, which work well but are not documented and may be an overkill (maybe easier to inline some boilerplate without going too much into variadic template overcomplications).

This is WIP. For the moment I have just

added a proof of concept of using fptype* instead of cxtype_sv for wavefunctions in the XXX interfaces, and then using reinterpret casts
I have also added the cxtype_ref class always, because this may be another way to move from RI pairs to complex types... it really depends wich memory structure we want in CUDA, but if we keep RI always contiguous, then a reinterpret cast may be a much simpler solution (and I can undo/revert the changes adding the cxtype_ref)

A lot of things to be done, including in random order, first step

adding a second template parameter to the XXX functions, for the wavefunction access class
add the wavefunction memory access class (initially just a reinterpret cast for a single event - this will always be the case for C++, well modulo the neppV vectorization handled by cxtype_v)
use the memorybuffer class for wavefunctions
do the same trick of moving from cxtype_sv to fptype also for the FFV functions
similarly add a template parameter there for wavefunction memory access in FFV functions
add memry buffers, memory access classes for amplitudes too
integrate the new memory buffers/access for amplitudes in the FFV functions an dthe calling sequence
add a second template parameter to the FFV funvtions for amplitude access

The first step above (quite long) should provide the same performance, with a single event access as is now. That is to say, the interfaces would change a lot, but the algorithm would remain the same and the performance should saty the same.

The second step then is

move to multi-event buffers (possibly AOSOA) for wavefunctions and amplitudes too, only in cuda
this will generate a lot more traffic with global memory, and it is likely to decrease performance if taken alone? in any case, do not yet attempt to split kernels

The third step is then

split the kernels and check if it gets any better...

valassi · 2022-01-17T15:28:50Z

I am halfway through this PR, but I am having a big issue with build times. Moving to template FFV functions has increased the compilation time enormously. It now takes 75 minutes (1h15min) to build gCPPProcess.o through nvcc for ggttgg. (I can do this for the various none/simd in parallel, but still). The eemumu and ggtt are ok, but ggttgg is just unbearable. Using top I am told that 'cicc' is taking the large CPU time.

This is clearly related to issues discussed in other places:

LTO and aggressive inlining Investigate link time optimizations (-flto) and inlining #229 : also with HELINL=1 the ggttgg build time increases, but not as much (and actually it is the gcc build time that increases, not the nvcc build time)
RDC More precise assessment of RDC cost #51 : one way out would be to split the CPPProcess.o and move the FFV functions to a separate .o file, but I am concerned this needs RDC and may be slower

Note also that

these template FFVs are really only needed to make the code more elegant in a way: if the HelAmps are in src, I do not want the MemoryAccess functions from P1 to be in src as they do not belong there... this can be easily hacked away somehow
note also that when we do split the kernels, I assume that thi slong build time will disappear because by definition we will have many smaller kernels, easier to optimize, rather than one huge kernel with link time optimization of cuda through it

I will have a quick detour on #51 and reassess RDC costs...

valassi · 2022-01-17T21:01:53Z

Ok I have reached some temporary conclusion here: finally move to cuda 11.5!

I have done some tests of RDC More precise assessment of RDC cost #51 (I will rebase on a merge of those tests). Indeed there is a penalty from rdc, even if only at 20% level or so (not the factors 10 we had seen long ago), RDC is no go, especially with a separate HelAmps
I also did various tests with inlining some bits and not others, but this is no good as a solution
out of lack of ideas, I tried to see if cuda 11.5 builds the code faster: now, it does BOTH build the code in a reasonable time, AND the runtime performance is faster! the performance regression Detailed analysis of performance regression from cuda 11.0 to cuda 11.4 #282 disappears... note however that this regression disappears ONLY if I also move to templated code in this apiwf PR. So essentially I must do the two together, move to templated methods AND also move to cuda 11.5

At this stage, all looks good:

I am able to move to templated FFV in this PR
I am able to move to cuda 11,5 (I must do that actually)
and out of these two things I even get a 2%+ performance increase (5.1 instead of 5.0 max tput on ggttgg cuda)

Now this PR needs to be completed with sevaral other points (even before moving to split kernels)

…PEV_BRK - prepare to always include cxtype_ref

…MGONGPU_HAS_CPPCXTYPEV_BRK), move it to cx header

…HelAmps functions - all ok

… prepare to generalise to all XXX

…tions to all XXX - all ok, same performance

…pers

…yAccessRandomNumbers

…yAccessWeights

…yAccessMatrixElements

…yAccessMomenta

…rom that for random numbers)

…ersion with reinterpret_cast<cxtype_sv*>

…HelAmps.h!

…XX functions - all ok, same perf

…/cuda116 at this point in time (Prepare to experiment with other compilers too)

…cc102 on ggttgg/avx2/512y, slower elsewhere!

Revert "[apiwf] test all three processes with clang12/cuda116 - faster than gcc102 on ggttgg/avx2/512y, slower elsewhere!" This reverts commit 8a7bb2d.

…sting performance, different physics? (NB even cuda116 only supports clang13, cannot use this yet through cuda)

Revert "[apiwf] test 3 procs with icx2022(based on clang14) + gcc102 - interesting performance, different physics?" This reverts commit 8d04e49.

… fully in SubProcess stuff)

…ggttgg auto

…k_sa.cc changes

… - all ok

…reference to `_intel_fast_memcpy'

…cal-compare warnings

… regenerate ggttgg auto

…/Makefile and mgOnGpuCxtypes.h changes

… - all ok

…eresting results (madgraph5#220)

Revert "[apiwf] rerun 3 procs with icx2021(clang13+gcc102)/cuda161 - very interesting results (madgraph5#220)" This reverts commit 6350a75.

…aybe a tiny bit better

…102/cuda116 (revert) Revert "[apiwf] rerun 3 procs with gcc112/cuda161 - very similar to gcc102, maybe a tiny bit better" This reverts commit fcdf1b6. (There are several things to sort out in a furture apiwf2, but I prefer to merge this now - and move to cuda116)

valassi · 2022-01-20T18:07:01Z

I am going to merge this PR even if the work it contains is logically incomplete.

There are memory buffers for wavefunctions and amplitudes, but they are not used/tested (and a hack reinterpret cast is used in CPPProcess.cc between w_sv and w_fP), this is a bit halfway an old and a new way of doing these things
The memory access classes by themselves are essentially dummy. And even the cxtype_ref has been always added when maybe it should not be there.
All these things will become clearer only moving twoards a startegy for splitting kernels.

The above was about the limitations. What this PR does contain however is

The move of FFV functions to templated FFV functions, which is required for memory access in split kernels, see Split the sigmakin kernel into smaller kernels #310. (By itself, using the old cuda11.1, this would explode build times).
The move to cuda11.6. This brings back build times of templated functions to a reasonable level. (And it actually needs the templated functions, as with the older code it would otherwise have a performance regression Detailed analysis of performance regression from cuda 11.0 to cuda 11.4 #282).
Combining cuda11.6 with templated functions, cuda runtime is now faster (~5.2 instead of ~5.0 for ggttgg).
A few fixes for icx2021/clang13 and icx2022/clang14. Performance tests for icx2021, icx2022 (see Build and test using the Intel compiler #220) and also gcc112.

With respect to the stuff I mentioned at the top, some todo and next steps are below

To clean up

Clean up the business of fptype* s cxtype_sv* for wavefunctions and amplitudes in FFVs. Maybe hide again the cxtype_ref class if not needed.
Clean up how memory buffers are assigned for wavefunctions and amplitudes in CPPProcess. The clean way will probably be to move this OUTSIDE calculate_wavefunction, and into the kernel launcher class calling calculate_wavefunction. This demands quite some structural changes, which are in any case needed for splitting kernels. In any case those reinterpret casts between w_sv and w_fp should disappear.

A new "first" step would then be

With the buffers oustide calculate wavefunctions, do not change the structure to global memory AOSOAs. Each buffer would still be a single event. You should get teh same performance as now.

The second step (only as a test case to study, clearly not a production version) would remain

move to multi-event buffers (possibly AOSOA) for wavefunctions and amplitudes too, only in cuda
this will generate a lot more traffic with global memory, and it is likely to decrease performance if taken alone? in any case, do not yet attempt to split kernels

The third step (probably to be done together with the second) would be

split the kernels and check if it gets any better...
I would actually start by putting jamps to global memory, rather than amps... and maybe I would start by just splitting sigmakin into two kernels, one for XXX+FFVs, the other for the color matrix algenra

I am also mertging this now because there are a couple of orthogonal issues which are higher priority

Olivier's wrong assert in cudacpp plugin (and test code generation for pptt) #337 which requires generating pptt rather thna ggtt
Some possible (almost certain) bugs in SIMD, due to some initializer lists (Probable bug in initializer list use in SIMD inline functions #339)
Maybe also investigate why icx2022 gives different physics values from icx2021 and other compilers../ another bug? (follow up in Investigate why physics results in icx2022 are different #338)

All checks have now succeded. I will merge

valassi self-assigned this Jan 12, 2022

valassi marked this pull request as draft January 12, 2022 08:48

valassi force-pushed the apiwf branch from 1b32740 to 0900f28 Compare January 17, 2022 20:54

valassi mentioned this pull request Jan 17, 2022

Throughput tests for RDC and for CUDA 11.5 #332

Merged

valassi changed the title ~~WIP - API (9a) Memory buffers and access for wavefunctions and amplitudes~~ WIP - API (9a) Memory buffers/access for wavefunctions/amplitudes - and move to cuda 11.5 Jan 17, 2022

valassi added 23 commits January 17, 2022 22:16

[apiwf] rename ifdef MGONGPU_HAS_CPPCXTYPE_REF as MGONGPU_HAS_CPPCXTY…

60fbdea

…PEV_BRK - prepare to always include cxtype_ref

[apiwf] ALWAYS include the cxtype_ref class in the code (also ifndef …

b641b78

…MGONGPU_HAS_CPPCXTYPEV_BRK), move it to cx header

[apiwf] add MemoryBufferWavefunctions and constexpr nx2=2

e1265e0

[apiwf] basic proof of concept of using fptype* for wavefunctions in …

cfcf00e

…HelAmps functions - all ok

[apiwf] further proof of concept of using fptype* for wavefunctions -…

ebb0cdc

… prepare to generalise to all XXX

[apiwf] generalise the proof of concept of using fptype* for wavefunc…

c16f11d

…tions to all XXX - all ok, same performance

[apiwf] improve documentation of MemoryAccessHelpers

ad155ed

[apiwf] further improve documentation of MemoryAccessHelpers

4d3ea5b

[apiwf] bug fixes in const vs non-const signatures in MemoryAccessHel…

994ae8a

…pers

[apiwf] improve documentation of MemoryAccessRandomNumbers

5123786

[apiwf] improve documentation of MemoryAccessWeights

ccfc1a2

[apiwf] improve documentation of MemoryAccessMatrixElements

7b25655

[apiwf] improve documentation of MemoryAccessMomenta

9d6f52e

[apiwf] document variadic template expansion of "Ts... args" in Memor…

85ac81e

…yAccessRandomNumbers

[apiwf] document variadic template expansion of "Ts... args" in Memor…

0ad253a

…yAccessWeights

[apiwf] document variadic template expansion of "Ts... args" in Memor…

4271415

…yAccessMatrixElements

[apiwf] document variadic template expansion of "Ts... args" in Memor…

48a38e8

…yAccessMomenta

[apiwf] test that all is ok with the current code

e798ae3

[apiwf] first draft version of MemoryAccessWavefunctions.h (derived f…

f6fe6e0

…rom that for random numbers)

[apiwf] comment out lengthy MemoryAccessWavefunctions, use a simple v…

7065b9a

…ersion with reinterpret_cast<cxtype_sv*>

[apiwf] cleanup - REMOVE kernelAccessMomenta that I had forgotten in …

253b9b3

…HelAmps.h!

[apiwf] test that all is ok with current code

d63fcf7

[apiwf] first successful test adding a second template [arameter to X…

2f29837

…XX functions - all ok, same perf

valassi added 21 commits January 19, 2022 17:34

Merge remote-tracking branch 'upstream/master' into apiwf

2cc78d0

[apiwf] regenerate reference logs for all three processes with gcc102…

b729ac2

…/cuda116 at this point in time (Prepare to experiment with other compilers too)

[apiwf] test all three processes with clang12/cuda116 - faster than g…

8a7bb2d

…cc102 on ggttgg/avx2/512y, slower elsewhere!

[apiwf] back to gcc102/cuda116 (revert)

a65e1d7

Revert "[apiwf] test all three processes with clang12/cuda116 - faster than gcc102 on ggttgg/avx2/512y, slower elsewhere!" This reverts commit 8a7bb2d.

[apiwf] test 3 procs with icx2022(based on clang14) + gcc102 - intere…

8d04e49

…sting performance, different physics? (NB even cuda116 only supports clang13, cannot use this yet through cuda)

[apiwf] back to gcc102/cuda116 (revert)

340b255

Revert "[apiwf] test 3 procs with icx2022(based on clang14) + gcc102 - interesting performance, different physics?" This reverts commit 8d04e49.

[apiwf] BUG FIX IN TESTMISC: add missing resync of eemumu auto to eemumu

2c68919

[apiwf] ggttgg FINALLY REMOVE CUDA FROM SRC MAKEFILE (curand.h is now…

cfb8af5

… fully in SubProcess stuff)

[apiwf] ggttgg fix build warnings from icx2022 in check_sa.cc

3e83a83

[apiwf] backport src/Makefile and check_sa.cc to codegen, regenerate …

cf04ab8

…ggttgg auto

[apiwf] regenerate eemumu/ggtt manual/auto with src/Makefile and chec…

2882fcb

…k_sa.cc changes

[apiwf] gcc102/cuda116 rerun all three processes after latest changes…

0dd536e

… - all ok

[apiwf] ggttgg on cuda116/icx2021, Subprocess/Makefile fix undefined …

bca6845

…reference to `_intel_fast_memcpy'

[apiwf] ggttgg on cuda116/icx2021, mgOnGpuCxtypes.h disable tautologi…

f2f2975

…cal-compare warnings

[apiwf] backport Subprocess/Makefile and mgOnGpuCxtypes.h to codegen,…

c4ca5a5

… regenerate ggttgg auto

[apiwf] regenerate eemumu/ggtt auto and resync manual with Subprocess…

6a20903

…/Makefile and mgOnGpuCxtypes.h changes

[apiwf] gcc102/cuda116 rerun all three processes after latest changes…

2d0a750

… - all ok

[apiwf] rerun 3 procs with icx2021(clang13+gcc102)/cuda161 - very int…

6350a75

…eresting results (madgraph5#220)

[apiwf] go back to gcc102/cuda116 (revert)

04c7d8b

Revert "[apiwf] rerun 3 procs with icx2021(clang13+gcc102)/cuda161 - very interesting results (madgraph5#220)" This reverts commit 6350a75.

[apiwf] rerun 3 procs with gcc112/cuda161 - very similar to gcc102, m…

fcdf1b6

…aybe a tiny bit better

valassi changed the title ~~WIP - API (9a) Memory buffers/access for wavefunctions/amplitudes (templated FFV functions) - and move to cuda 11.5~~ API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) Jan 20, 2022

valassi marked this pull request as ready for review January 20, 2022 18:07

valassi merged commit 5ac285c into madgraph5:master Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328

API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328

Uh oh!

valassi commented Jan 12, 2022

Uh oh!

valassi commented Jan 17, 2022

Uh oh!

valassi commented Jan 17, 2022

Uh oh!

valassi commented Jan 20, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328

API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) #328

Uh oh!

Conversation

valassi commented Jan 12, 2022

Uh oh!

valassi commented Jan 17, 2022

Uh oh!

valassi commented Jan 17, 2022

Uh oh!

valassi commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

valassi commented Jan 20, 2022 •

edited

Loading