Final HIP Platform implementation for AMD GPUs on ROCm #3338

AJcodes · 2021-11-18T10:07:43Z

Hi Dr. Eastman and co.,

This MR continues the work that was done here, where we made sure that the current code is up to date with the latest release of the OpenMM master branch, and supports all the same functionality as the current CUDA platform (this includes all of the CUDA-specific plugins too, such as amoeba).

Additionally, further optimisations have been made in this MR to ensure the best possible performance on AMD GPUs. Such optimisations include load-balancing, the use of shuffle operations, etc that are specific to the HIP platform.

As for testing, we focused on two sets of tests; the unit tests within the repository and the tests from openmmtools. On our side, the unit tests within the repository all pass. I've attached a log from the tests we ran from openmmtools which we ran on a Mi100 (64 tests pass, and 4 fail, but this is what we also got when running the tests on a RTX3090).

2021-11-10-openmmtools-0-mi100-fixed-amoeba.log

Given that the scope of this PR is extensive, I'd like to discuss the next steps for this (e.g. how can we help to ease the reviewing process)

peastman · 2021-11-18T17:09:12Z

Thanks for your work on this! Let's discuss some options for moving forward.

We had a lot of discussion on the first PR. See especially my comments at #2736 (comment). Merging this into the main OpenMM code base isn't going to happen in the foreseeable future. It's a huge amount of new code (over 43,000 lines added) which I would be accepting responsibility for maintaining, and I simply don't have the bandwidth to do it. Given the limited number of our users with AMD GPUs, and the fact that we already support them with OpenCL, I just can't justify adding it.

If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up.

giadefa · 2021-11-18T17:22:26Z

out of curiosity, what is the speed difference of HIP against the OpenCL implementation and what against a RTX3090?

…

On Thu, Nov 18, 2021 at 6:09 PM Peter Eastman ***@***.***> wrote: Thanks for your work on this! Let's discuss some options for moving forward. We had a lot of discussion on the first PR. See especially my comments at #2736 (comment) <#2736 (comment)>. Merging this into the main OpenMM code base isn't going to happen in the foreseeable future. It's a huge amount of new code (over 43,000 lines added) which I would be accepting responsibility for maintaining, and I simply don't have the bandwidth to do it. Given the limited number of our users with AMD GPUs, and the fact that we already support them with OpenCL, I just can't justify adding it. If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3338 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOR4DYMRS6UK2DQQY7DUMUXMPANCNFSM5IJGMESA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jchodera · 2021-11-18T17:28:36Z

Does the forthcoming AMD MI250 change the equation any?

AJcodes · 2021-11-24T12:23:14Z

If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up.

@peastman Thanks for your reply, perhaps we could get started with making this conda installable. What are the next steps for this?

AJcodes · 2021-11-24T12:25:37Z

Does the forthcoming AMD MI250 change the equation any?

@jchodera The optimisations in this MR do take the Mi250 into account

peastman · 2021-11-24T16:33:58Z

Great! Let's do that.

The first thing you'll want to do is bring this up to date with the current OpenMM code. It looks like it branched off a while ago, and github reports conflicts that would prevent merging it. So merge the latest changes from the master branch, fix conflicts, and get it working again.

The next step is to move it into its own repository and allow it to be built independently. Mostly that should be straightforward. It's already an independent plugin, so you probably won't need to make any code changes. But the CMake scripts will require some reworking to separate it out. You might want to look at the example plugin, which can serve as an example for doing that.

Once that is done, the final step is to create a feedstock on conda-forge for packaging it. We can help with that once you get to that point.

AJcodes · 2021-12-02T14:43:51Z

Great! Let's do that.

The first thing you'll want to do is bring this up to date with the current OpenMM code. It looks like it branched off a while ago, and github reports conflicts that would prevent merging it. So merge the latest changes from the master branch, fix conflicts, and get it working again.

The next step is to move it into its own repository and allow it to be built independently. Mostly that should be straightforward. It's already an independent plugin, so you probably won't need to make any code changes. But the CMake scripts will require some reworking to separate it out. You might want to look at the example plugin, which can serve as an example for doing that.

Once that is done, the final step is to create a feedstock on conda-forge for packaging it. We can help with that once you get to that point.

@peastman I've resolved the merge conflict, so we're ready for the next step. By the way, would it be possible to host this code as a branch on the official OpenMM repository instead of a separate repository? We'd like to have the code be accessible to all OpenMM users without having it in another repository.

peastman · 2021-12-02T17:37:31Z

Since you'll be maintaining it, it would be best to have it in a repository you own. That repository won't include all of OpenMM. It will include only the code for the new plugin. This is similar to how we handle other plugins that are distributed separately, like https://github.com/openmm/openmm-torch and https://github.com/openmm/openmm-plumed/. Those repositories can also serve as useful examples.

AJcodes · 2021-12-03T10:23:35Z

@peastman One last question on the topic; would it be possible to have a repository in the OpenMM Github group (e.g. openmm/openmm-hip)?

peastman · 2021-12-03T16:28:45Z

I'd need to discuss it with all the PIs. Putting it in the openmm organization could imply it's code we maintain and support.

None of this will matter to users, of course. They'll just type

conda install -c conda-forge openmm-hip

jchodera · 2021-12-03T17:09:37Z

@AJcodes : Huge thanks for all the effort in getting this fully working!

A few questions that will help us out when the OpenMM team meets soon:

Have you folks run the full set of OpenMM benchmark suite on any hardware? Especially interesting is the comparison of OpenCL and HIP
If we did bring this into the OpenMM GitHub org as a separate repo for the plugin, what kind of commitment to help maintain the code could be provided? If the commitment was limited, could we have a warning that this is experimental?
Is there any interest in having the HIP platform also available on the OpenMM core deployed across Folding@home, or is the primary interest in achieving top performance on high-end datacenter-grade GPUs?

If it would be easier to discuss some of these topics in a call, just reach out by email and I can help set something up.

muziqaz · 2021-12-11T11:00:43Z

I also would like to thank @AJcodes for incredible work done regarding this. On top of that I would like to add few arguments for adding HIP support to OpenMM.
While OpenCL support is there for AMD GPUs, performance is really poor to put it politely. Moreover, AMD seem to have started to distance themselves from OpenCL support in past year. We are seeing up to 50% of Folding@Home (and other distributed computing projects) performance degradation going from driver version 21.3.2 (Windows) to anything after that. Even with 21.3.2 drivers AMD is far behind nVidia CUDA initiative, so add another 50% drop in performance, and we have no choice, but to exclude all the AMD hardware from future F@H projects. AMD seem to be completely oblivious to our cries about it. So I figure they are now fully concentrating on HIP instead of OpenCL. Having HIP support on OpenMM would save the existing base of AMD users at F@H (and other distributed computing projects), and it would not put off new users, which are plenty, due to AMD being very competitive across the board nowadays. We see incredible inroads done by AMD in various GPU markets, which haven't happened for very long time, if ever. It would be beneficial to OpenMM project to have HIP support, otherwise we end up looking stagnant in regards to AMD hardware by settling for OpenCL.
I know AMD themselves are big part of the issue here.
If there is a need of AMD hardware to do tests on it should be possible to make arrangements to get access to Vega64, Radeon 7, 5700xt and 6900xt (Windows only, though) remotely.
Regards

xCaradhras · 2022-02-06T19:50:07Z

No updates in almost 3 month...is this initiative dead?

ex-rzr · 2022-02-07T11:50:43Z

No updates in almost 3 month...is this initiative dead?

No, it's not dead. You can check it: https://github.com/StreamHPC/openmm-hip

We discussed with OpenMM team: the new approach is to split HIP-related changes in two parts: the HIP backend as a plugin and some changes in OpenMM (required for HIP - in common kernels etc.).

I'll merge the recent changes to https://github.com/StreamHPC/openmm and resolve conflicts, then we'll update this PR (or perhaps create a new one).

AJcodes · 2022-02-08T11:17:31Z

@peastman @jchodera As discussed, @ex-rzr has completed the split. This PR now contains changes in OpenMM (in the common kernels, etc) that are required for HIP, and is ready for review

peastman · 2022-02-08T18:40:34Z

Thanks! I'll start going through it and making comments. It may take me a little while--there's a lot to look through!

Could you comment on your changes to the benchmarking script? Some of them are obvious changes to support HIP, but others seemed to be adding new features, and it wasn't clear what the reason for them was.

peastman · 2022-02-08T18:42:58Z

platforms/common/include/openmm/common/ComputeVectorTypes.h

@@ -57,6 +57,20 @@ struct mm_int4 {
    mm_int4(int x, int y, int z, int w) : x(x), y(y), z(z), w(w) {
    }
 };
+struct mm_long2 {
+    long x, y;


You want this to be long long. See https://en.cppreference.com/w/cpp/language/types. In many compilers, long is only 32 bit.

peastman · 2022-02-08T19:01:07Z

One note on naming. We use the word "warp" to refer to a block of 32 threads, even when that doesn't match the SIMD width of the processor. It's kind of an abuse of notation, but we use it pretty widely. For example, the SYNC_WARPS macro ensures that groups of 32 threads are synchronized. On Intel GPUs where the SIMD width is only 16, that means synchronizing the entire block. It's a bit more than strictly required, and possibly hurts performance a bit. But not very much, and anyway, we aren't too concerned about getting the best possible performance on very low end GPUs.

peastman · 2022-02-08T19:03:25Z

platforms/common/src/kernels/customCentroidBond.cc

    for (int group = GROUP_ID; group < numParticleGroups; group += NUM_GROUPS) {
        // The threads in this block work together to compute the center one group.

        int firstIndex = groupOffsets[group];
        int lastIndex = groupOffsets[group+1];
        real3 center = make_real3(0);
-        for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += LOCAL_SIZE) {
+        for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += BLOCK_SIZE) {


How is BLOCK_SIZE different from LOCAL_SIZE?

In other kernels where we need to define a macro for the thread block size, we call it THREAD_BLOCK_SIZE to be clear.

platforms/common/include/openmm/common/CommonKernels.h

peastman · 2022-02-08T19:45:43Z

platforms/common/src/CommonKernels.cpp

+    replacements["BLOCK_SIZE"] = cc.intToString(this->blockSize);
+    replacements["WARP_SIZE"] = cc.intToString(cc.getSIMDWidth());


Following on my earlier comments, "warp size" is always defined as 32, regardless of the hardware. Be aware that getSIMDWidth() is not entirely reliable, since there's no standard mechanism for determining it in OpenCL. If we aren't sure what the width is, that method returns 1.

gunnarre · 2022-02-08T21:24:24Z

We discussed with OpenMM team: the new approach is to split HIP-related changes in two parts: the HIP backend as a plugin and some changes in OpenMM (required for HIP - in common kernels etc.).

Has it been decided to integrate HIP support into the Folding at Home GPU core too?

peastman · 2022-02-10T01:39:50Z

platforms/common/src/CommonKernels.cpp

 void CommonCalcCustomNonbondedForceKernel::initInteractionGroups(const CustomNonbondedForce& force, const string& interactionSource, const vector<string>& tableTypes) {
    // Process groups to form tiles.
-
+
+    const int tileSize = cc.getTileSize();


The changes in this class are based on the assumption that the tile size used for computing interaction groups is the same as the size used for standard nonbonded interactions. They both happen to be 32, but there's no reason they need to be the same. They're computed with different code using different neighbor lists. I think it's best to leave it as it is, with the size just hardcoded to 32. Some day we might (or might not) decide it would be useful to make that configurable, but either way it shouldn't be required to match the main nonbonded kernel.

peastman · 2022-02-10T01:43:50Z

platforms/common/src/CommonKernels.cpp

@@ -3011,7 +3048,8 @@ void CommonCalcCustomGBForceKernel::initialize(const System& system, const Custo
        pairValueDefines["NUM_ATOMS"] = cc.intToString(cc.getNumAtoms());
        pairValueDefines["PADDED_NUM_ATOMS"] = cc.intToString(cc.getPaddedNumAtoms());
        pairValueDefines["NUM_BLOCKS"] = cc.intToString(numAtomBlocks);
-        pairValueDefines["TILE_SIZE"] = "32";
+        pairValueDefines["TILE_SIZE"] = cc.intToString(tileSize);
+        pairValueDefines["tileflags"] = (tileSize > 32 ? "unsigned long" : "unsigned int");


That should be mm_ulong rather than unsigned long. long is 32 bits in CUDA, 64 bits in OpenCL.

And likewise in similar code below.

peastman · 2022-02-10T01:48:47Z

platforms/common/src/CommonKernels.cpp

@@ -4390,17 +4430,19 @@ CommonCalcCustomManyParticleForceKernel::~CommonCalcCustomManyParticleForceKerne

 void CommonCalcCustomManyParticleForceKernel::initialize(const System& system, const CustomManyParticleForce& force) {
    ContextSelector selector(cc);
+    const int tileSize = cc.getTileSize();


This is another class that's computed with completely different code from standard nonbonded interactions. It doesn't need to use the same tile size they do.

peastman · 2022-02-10T01:49:29Z

platforms/common/src/CommonKernels.cpp

    // Create data structures used for the neighbor list.

-    int numAtomBlocks = (numRealParticles+31)/32;
+    const int tileSize = cc.getTileSize();


Another one that shouldn't depend on the tile size used by the main nonbonded kernel.

peastman · 2022-02-10T01:58:34Z

platforms/common/src/kernels/gbsaObc.cc


    // First loop: process tiles that contain exclusions.
-    
+#if !defined(USE_HIP)


Can you explain what this change is about? I don't understand it.

peastman · 2022-02-10T02:00:21Z

platforms/common/src/kernels/pme.cc

@@ -12,7 +12,12 @@ KERNEL void findAtomGridIndex(GLOBAL const real4* RESTRICT posq, GLOBAL int2* RE
    ) {
    // Compute the index of the grid point each atom is associated with.

+#if !defined(USE_FLAT_KERNELS)


Can you explain what this is about?

peastman · 2022-02-10T02:03:19Z

platforms/common/src/kernels/removeCM.cc

+    // Easier to cope with varying block / wavefront sizes w/o perf. penalty if
+    // expressed as a constexpr reduction


Keep in mind that CM motion removal takes truly negligible time. It's not worth adding any complexity to optimize it, because there will be no practical benefit.

It allows to build and install common platform files even if CUDA or OpenCL platforms are not built. This is required for HIP platform (openmm-hip) if ROCm OpenCL packages are not installed.

OPENMM_PYTHON_USER_INSTALL is OFF be default.

The HIP platform supports FFT backends, this commit moves findLegalFFTDimension to ComputeContext, so platforms can have their own implementations.

ex-rzr · 2022-07-18T10:41:53Z

@peastman, could you check the current version?
I removed everything related to tile sizes and some other not very important code.

There are still some changes that are not mandatory:

70dab08 - a more convenient way to install python packages for the current user only (we use it when multiple developers need to work on OpenMM without conflicts)
f34057e - the comment says what the change does. I added it after I spent an hour trying to understand why a changed kernel worked incorrectly. It turned out that during development I copied some code with commands like SHUFFLE_WARP_DATA and commented one copy with // which in the final code make only the first line commented. This check handles such situations.

We think it's ok to remove these changes if you don't like them, just say a word. I also can split them in separate PRs (though they are very small).

Thanks!

peastman · 2022-07-18T15:06:58Z

Thanks! I'll take a look. Could you also remove the changes to benchmark.py for the moment, just because we have another open PR that rewrites large parts of that script? Once that's merged we can make any other changes to it in a separate PR.

The generated code is not optimal, for example, the compiler generates flat_load instructions instead of ds_read.

Force the compiler to use all registers for gridSpreadCharge and gridInterpolateForce by limiting max waves per EU to 1 on CDNA GPUs, RDNA GPUs work better without it.

peastman

This looks great. If we can avoid duplicating all the AtomData struct definitions, that would be preferable. Otherwise, I think it's ready to merge.

peastman · 2022-07-19T19:33:37Z

platforms/common/src/kernels/gbsaObc.cc

+#if defined(USE_HIP)
+
+typedef struct alignas(16) {


Since the structures are identical other than alignment, what about writing it like this?

#if defined(USE_HIP) #define ALIGN alignas(16) #else #define ALIGN #endif typedef struct ALIGN {

That way we don't repeat the structure definition, which risks someone changing it in one place but not the other.

You are right! Done.

Regarding other structs: I can use the same approach with defines to hide the differences (alignment, paddings) but they also have a different order of fields. I'm worried to rearrange fields for all platforms because I'm not sure that it won't affect performance. I can profile GBSA and Amoeba benchmarks on a few Nvidia GPUs with CUDA: 3090, Titan V, 2060, but definitely this can't cover all devices where OpenMM is used.
In my opinion, it seems safer to keep separate definitions. Also it's likely that we'll be able to remove at least some of this commit in the future because (as far as I know) developers of the HIP compiler continue to work on improving code generation of shared memory accesses.

What do you think?

peastman · 2022-07-21T15:04:09Z

So far as I know, the order of fields in a shared memory struct should never affect performance on an NVIDIA GPU. It's not like global memory, where the order could affect cache performance. If you test a few GPUs and confirm it doesn't change performance, that's good enough for me.

Manually rearrange fields, add paddings and force alignments to have faster accesses to shared memory: ds_read and ds_write may work slower if addresses are not aligned by 16 bytes.

ex-rzr · 2022-07-22T06:33:44Z

Ok, I've combined structure definitions and profiled gbsa (computeBornSum, computeGBSAForce1), amoebagk and amoebapme (computeGKForces, computeInducedField, computeEDiffForce, computeElectrostatics) on those 3 devices: I don't see any performance drops caused by fields reordering.

peastman · 2022-07-22T15:12:39Z

That looks good. Which means it's ready to merge!

DanielWicz · 2023-01-31T17:25:31Z

In which version of OpenMM, HIP is expected to be used ? Is 8.0 already using it, or only OpenCL/CUDA ?

ex-rzr · 2023-01-31T17:41:30Z

@DanielWicz

In which version of OpenMM, HIP is expected to be used ?

This PR contains changes in common files we needed for HIP support.
The HIP platform is a plugin which you can find here: https://github.com/amd/openmm-hip

There are instructions how to build it from sources or install a package from conda (I've noticed that OpenMM 8.0.0 has been released, I'll need to rebuild conda package for this version).

If you have any questions or problems, please open an issue in that repository (and ping me, just to be sure).

egallicc · 2023-01-31T17:48:15Z

We are testing 8.0.0rc1 with openmm-hip building from sources downloaded from https://github.com/StreamHPC/openmm-hip. All tests are passing without changes.

egallicc · 2023-02-01T00:21:12Z

All openmm-8.0.0rc1 + openmm-hip tests passed except TestHipFFTImplHipFFT{Single,Mixed,Double} because the executable could not be found.

ex-rzr · 2023-02-01T06:04:30Z

Thanks, @egallicc

You're using RDNA GPUs, correct? What ROCm version, OS? (I'm just gathering statistics, as it's impossible to test everywhere)

TestHipFFTImplHipFFT{Single,Mixed,Double} because the executable could not be found

That's interesting. These tests had failures on older ROCm version, but they shouldn't fail because of missing executables. Can you post the log?

egallicc · 2023-02-05T04:27:38Z

Yes, @ex-rzr
GPU: RX 6750 XT (with HSA_OVERRIDE_GFX_VERSION=10.3.0)
Host: Ubuntu 20.04
ROCm: 5.2.0

Test log: https://www.dropbox.com/s/k86vbmjo9hezo8q/LastTest-1.log?dl=0
(actually, it wasn't missing executables)

ex-rzr · 2023-02-05T04:34:00Z

Ok, this is a known issue with rocFFT (a specific problem size): ROCm/hipFFT#26. It's fixed in recent releases of ROCm. But anyway, VkFFT is used by default, so it shouldn't be a problem.

muziqaz · 2023-03-27T10:21:38Z

https://docs.google.com/spreadsheets/d/1_FMU8mKlWb4LEp3mOwsHwiwgSMKQWn2WxNxdp7QZsfg/edit?usp=sharing
OpenMM 8.0.0 performance comparison between OpenCL and HIP on Linux AMD. RX 550 is no longer supported neither with OpenCL nor HIP in Linux
Depending on time availability, I will try compiling newer OpenMM version with newer HIP versions (if available) to see if something was done to sort out 7900xtx performance (HIP side). Also Windows HIP SDK seems to be available, will have to check that out. These tests were run from standard conda env set up.
As you can see performance increase is beyond unbelievable.

muziqaz · 2023-04-20T12:42:11Z

One more update for you guys:
We are testing 1.2m atom project on F@H.
I ran it through HIP on 6900xt Linux Openmm 8.0:
OpenCL 8.2ns/day
HIP - 22.3ns/day
For comparison 4070ti/3080ti on CUDA - 17ns/day

AJcodes force-pushed the develop_stream branch from 331654d to 21f05f9 Compare February 8, 2022 09:51

peastman reviewed Feb 8, 2022

View reviewed changes

platforms/common/include/openmm/common/CommonKernels.h Outdated Show resolved Hide resolved

peastman reviewed Feb 8, 2022

View reviewed changes

peastman reviewed Feb 10, 2022

View reviewed changes

ex-rzr added 4 commits July 18, 2022 11:04

Do not allow to replace symbols in single-line comments

f34057e

Add OPENMM_BUILD_COMMON CMake option

94c7283

It allows to build and install common platform files even if CUDA or OpenCL platforms are not built. This is required for HIP platform (openmm-hip) if ROCm OpenCL packages are not installed.

Add an option for Python wrapper to install into user packages

70dab08

OPENMM_PYTHON_USER_INSTALL is OFF be default.

Support FFT backends in Amoeba plugin

6fec7d1

The HIP platform supports FFT backends, this commit moves findLegalFFTDimension to ComputeContext, so platforms can have their own implementations.

ex-rzr force-pushed the develop_stream branch from ce22dbe to 0a66712 Compare July 18, 2022 09:37

skyreflectedinmirrors and others added 3 commits July 19, 2022 10:37

Compatibility for common platform w/ new HIP platform

6b81e5d

Do not use volatile with private and local AtomData parameters on HIP

0690a40

The generated code is not optimal, for example, the compiler generates flat_load instructions instead of ds_read.

Tune launch bounds for PME grid-related kernels and add WA for RDNA

97e1484

Force the compiler to use all registers for gridSpreadCharge and gridInterpolateForce by limiting max waves per EU to 1 on CDNA GPUs, RDNA GPUs work better without it.

ex-rzr force-pushed the develop_stream branch from 0a66712 to 58dabfe Compare July 19, 2022 04:39

peastman reviewed Jul 19, 2022

View reviewed changes

ex-rzr force-pushed the develop_stream branch from 58dabfe to affb4f7 Compare July 21, 2022 11:55

Optimize atom data structs in GBSA and Amoeba on HIP

ab95bfb

Manually rearrange fields, add paddings and force alignments to have faster accesses to shared memory: ds_read and ds_write may work slower if addresses are not aligned by 16 bytes.

ex-rzr force-pushed the develop_stream branch from affb4f7 to ab95bfb Compare July 22, 2022 06:28

peastman merged commit a39fa14 into openmm:master Jul 22, 2022

philipturner mentioned this pull request Jan 25, 2023

Future of osx GPU support #2489

Open

Entropy-Enthalpy mentioned this pull request Aug 21, 2023

Update benchmarks for 8.0 openmm/openmm-org#94

Closed

		replacements["BLOCK_SIZE"] = cc.intToString(this->blockSize);
		replacements["WARP_SIZE"] = cc.intToString(cc.getSIMDWidth());


		// First loop: process tiles that contain exclusions.

		#if !defined(USE_HIP)

		// Easier to cope with varying block / wavefront sizes w/o perf. penalty if
		// expressed as a constexpr reduction

Final HIP Platform implementation for AMD GPUs on ROCm #3338

Final HIP Platform implementation for AMD GPUs on ROCm #3338

Conversation

AJcodes commented Nov 18, 2021

peastman commented Nov 18, 2021

giadefa commented Nov 18, 2021 via email

jchodera commented Nov 18, 2021

AJcodes commented Nov 24, 2021

AJcodes commented Nov 24, 2021

peastman commented Nov 24, 2021

AJcodes commented Dec 2, 2021

peastman commented Dec 2, 2021

AJcodes commented Dec 3, 2021

peastman commented Dec 3, 2021

jchodera commented Dec 3, 2021

muziqaz commented Dec 11, 2021

xCaradhras commented Feb 6, 2022

ex-rzr commented Feb 7, 2022

AJcodes commented Feb 8, 2022

peastman commented Feb 8, 2022

Choose a reason for hiding this comment

peastman commented Feb 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunnarre commented Feb 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ex-rzr commented Jul 18, 2022

peastman commented Jul 18, 2022

peastman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peastman commented Jul 21, 2022

ex-rzr commented Jul 22, 2022

peastman commented Jul 22, 2022

DanielWicz commented Jan 31, 2023 • edited

ex-rzr commented Jan 31, 2023

egallicc commented Jan 31, 2023

egallicc commented Feb 1, 2023

ex-rzr commented Feb 1, 2023

egallicc commented Feb 5, 2023

ex-rzr commented Feb 5, 2023 • edited

muziqaz commented Mar 27, 2023

muziqaz commented Apr 20, 2023

gunnarre commented Feb 8, 2022 •

edited

DanielWicz commented Jan 31, 2023 •

edited

ex-rzr commented Feb 5, 2023 •

edited