Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Jan 12, 2022

This is the WIP beginning of a PR for step9 of #323, namely adding memory buffers and memory access classes for wavefunctions and amplitudes. The motivation for this is allowing to split the cuda signakin kernel into many smaller kernels for the XXX and FFV functions (also being protyped in #242), and possibly a sperate kernel for color algebra (#155 and #118).

Note that the motivation here is exclusively cuda/device, but I am defining an interface that applies also to c++. The idea is also to clean up and simplify the MemoryAccess classes, which work well but are not documented and may be an overkill (maybe easier to inline some boilerplate without going too much into variadic template overcomplications).

This is WIP. For the moment I have just

  • added a proof of concept of using fptype* instead of cxtype_sv for wavefunctions in the XXX interfaces, and then using reinterpret casts
  • I have also added the cxtype_ref class always, because this may be another way to move from RI pairs to complex types... it really depends wich memory structure we want in CUDA, but if we keep RI always contiguous, then a reinterpret cast may be a much simpler solution (and I can undo/revert the changes adding the cxtype_ref)

A lot of things to be done, including in random order, first step

  • adding a second template parameter to the XXX functions, for the wavefunction access class
  • add the wavefunction memory access class (initially just a reinterpret cast for a single event - this will always be the case for C++, well modulo the neppV vectorization handled by cxtype_v)
  • use the memorybuffer class for wavefunctions
  • do the same trick of moving from cxtype_sv to fptype also for the FFV functions
  • similarly add a template parameter there for wavefunction memory access in FFV functions
  • add memry buffers, memory access classes for amplitudes too
  • integrate the new memory buffers/access for amplitudes in the FFV functions an dthe calling sequence
  • add a second template parameter to the FFV funvtions for amplitude access

The first step above (quite long) should provide the same performance, with a single event access as is now. That is to say, the interfaces would change a lot, but the algorithm would remain the same and the performance should saty the same.

The second step then is

  • move to multi-event buffers (possibly AOSOA) for wavefunctions and amplitudes too, only in cuda
  • this will generate a lot more traffic with global memory, and it is likely to decrease performance if taken alone? in any case, do not yet attempt to split kernels

The third step is then

  • split the kernels and check if it gets any better...

@valassi valassi self-assigned this Jan 12, 2022
@valassi valassi marked this pull request as draft January 12, 2022 08:48
@valassi
Copy link
Member Author

valassi commented Jan 17, 2022

I am halfway through this PR, but I am having a big issue with build times. Moving to template FFV functions has increased the compilation time enormously. It now takes 75 minutes (1h15min) to build gCPPProcess.o through nvcc for ggttgg. (I can do this for the various none/simd in parallel, but still). The eemumu and ggtt are ok, but ggttgg is just unbearable. Using top I am told that 'cicc' is taking the large CPU time.

This is clearly related to issues discussed in other places:

Note also that

  • these template FFVs are really only needed to make the code more elegant in a way: if the HelAmps are in src, I do not want the MemoryAccess functions from P1 to be in src as they do not belong there... this can be easily hacked away somehow
  • note also that when we do split the kernels, I assume that thi slong build time will disappear because by definition we will have many smaller kernels, easier to optimize, rather than one huge kernel with link time optimization of cuda through it

I will have a quick detour on #51 and reassess RDC costs...

@valassi
Copy link
Member Author

valassi commented Jan 17, 2022

Ok I have reached some temporary conclusion here: finally move to cuda 11.5!

  • I have done some tests of RDC More precise assessment of RDC cost #51 (I will rebase on a merge of those tests). Indeed there is a penalty from rdc, even if only at 20% level or so (not the factors 10 we had seen long ago), RDC is no go, especially with a separate HelAmps
  • I also did various tests with inlining some bits and not others, but this is no good as a solution
  • out of lack of ideas, I tried to see if cuda 11.5 builds the code faster: now, it does BOTH build the code in a reasonable time, AND the runtime performance is faster! the performance regression Detailed analysis of performance regression from cuda 11.0 to cuda 11.4 #282 disappears... note however that this regression disappears ONLY if I also move to templated code in this apiwf PR. So essentially I must do the two together, move to templated methods AND also move to cuda 11.5

At this stage, all looks good:

  • I am able to move to templated FFV in this PR
  • I am able to move to cuda 11,5 (I must do that actually)
  • and out of these two things I even get a 2%+ performance increase (5.1 instead of 5.0 max tput on ggttgg cuda)

Now this PR needs to be completed with sevaral other points (even before moving to split kernels)

@valassi valassi changed the title WIP - API (9a) Memory buffers and access for wavefunctions and amplitudes WIP - API (9a) Memory buffers/access for wavefunctions/amplitudes - and move to cuda 11.5 Jan 17, 2022
…PEV_BRK - prepare to always include cxtype_ref
…MGONGPU_HAS_CPPCXTYPEV_BRK), move it to cx header
…/cuda116 at this point in time

(Prepare to experiment with other compilers too)
…cc102 on ggttgg/avx2/512y, slower elsewhere!
Revert "[apiwf] test all three processes with clang12/cuda116 - faster than gcc102 on ggttgg/avx2/512y, slower elsewhere!"
This reverts commit 8a7bb2d.
…sting performance, different physics?

(NB even cuda116 only supports clang13, cannot use this yet through cuda)
Revert "[apiwf] test 3 procs with icx2022(based on clang14) + gcc102 - interesting performance, different physics?"
This reverts commit 8d04e49.
Revert "[apiwf] rerun 3 procs with icx2021(clang13+gcc102)/cuda161 - very interesting results (madgraph5#220)"
This reverts commit 6350a75.
…102/cuda116 (revert)

Revert "[apiwf] rerun 3 procs with gcc112/cuda161 - very similar to gcc102, maybe a tiny bit better"
This reverts commit fcdf1b6.

(There are several things to sort out in a furture apiwf2, but I prefer to merge this now - and move to cuda116)
@valassi valassi changed the title WIP - API (9a) Memory buffers/access for wavefunctions/amplitudes (templated FFV functions) - and move to cuda 11.5 API (9a) Templated FFV functions (memory access for wavefunctions/amplitudes) - move to cuda 11.6 (and test icx2021/2022 and gcc112) Jan 20, 2022
@valassi
Copy link
Member Author

valassi commented Jan 20, 2022

I am going to merge this PR even if the work it contains is logically incomplete.

  • There are memory buffers for wavefunctions and amplitudes, but they are not used/tested (and a hack reinterpret cast is used in CPPProcess.cc between w_sv and w_fP), this is a bit halfway an old and a new way of doing these things
  • The memory access classes by themselves are essentially dummy. And even the cxtype_ref has been always added when maybe it should not be there.
  • All these things will become clearer only moving twoards a startegy for splitting kernels.

The above was about the limitations. What this PR does contain however is


With respect to the stuff I mentioned at the top, some todo and next steps are below

To clean up

  • Clean up the business of fptype* s cxtype_sv* for wavefunctions and amplitudes in FFVs. Maybe hide again the cxtype_ref class if not needed.
  • Clean up how memory buffers are assigned for wavefunctions and amplitudes in CPPProcess. The clean way will probably be to move this OUTSIDE calculate_wavefunction, and into the kernel launcher class calling calculate_wavefunction. This demands quite some structural changes, which are in any case needed for splitting kernels. In any case those reinterpret casts between w_sv and w_fp should disappear.

A new "first" step would then be

  • With the buffers oustide calculate wavefunctions, do not change the structure to global memory AOSOAs. Each buffer would still be a single event. You should get teh same performance as now.

The second step (only as a test case to study, clearly not a production version) would remain

  • move to multi-event buffers (possibly AOSOA) for wavefunctions and amplitudes too, only in cuda
  • this will generate a lot more traffic with global memory, and it is likely to decrease performance if taken alone? in any case, do not yet attempt to split kernels

The third step (probably to be done together with the second) would be

  • split the kernels and check if it gets any better...
  • I would actually start by putting jamps to global memory, rather than amps... and maybe I would start by just splitting sigmakin into two kernels, one for XXX+FFVs, the other for the color matrix algenra

I am also mertging this now because there are a couple of orthogonal issues which are higher priority

All checks have now succeded. I will merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant