Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final HIP Platform implementation for AMD GPUs on ROCm #3338

Merged
merged 9 commits into from Jul 22, 2022

Conversation

AJcodes
Copy link
Contributor

@AJcodes AJcodes commented Nov 18, 2021

Hi Dr. Eastman and co.,

This MR continues the work that was done here, where we made sure that the current code is up to date with the latest release of the OpenMM master branch, and supports all the same functionality as the current CUDA platform (this includes all of the CUDA-specific plugins too, such as amoeba).

Additionally, further optimisations have been made in this MR to ensure the best possible performance on AMD GPUs. Such optimisations include load-balancing, the use of shuffle operations, etc that are specific to the HIP platform.

As for testing, we focused on two sets of tests; the unit tests within the repository and the tests from openmmtools. On our side, the unit tests within the repository all pass. I've attached a log from the tests we ran from openmmtools which we ran on a Mi100 (64 tests pass, and 4 fail, but this is what we also got when running the tests on a RTX3090).

2021-11-10-openmmtools-0-mi100-fixed-amoeba.log

Given that the scope of this PR is extensive, I'd like to discuss the next steps for this (e.g. how can we help to ease the reviewing process)

@peastman
Copy link
Member

Thanks for your work on this! Let's discuss some options for moving forward.

We had a lot of discussion on the first PR. See especially my comments at #2736 (comment). Merging this into the main OpenMM code base isn't going to happen in the foreseeable future. It's a huge amount of new code (over 43,000 lines added) which I would be accepting responsibility for maintaining, and I simply don't have the bandwidth to do it. Given the limited number of our users with AMD GPUs, and the fact that we already support them with OpenCL, I just can't justify adding it.

If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up.

@giadefa
Copy link
Member

giadefa commented Nov 18, 2021 via email

@jchodera
Copy link
Member

Does the forthcoming AMD MI250 change the equation any?

@AJcodes
Copy link
Contributor Author

AJcodes commented Nov 24, 2021

If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up.

@peastman Thanks for your reply, perhaps we could get started with making this conda installable. What are the next steps for this?

@AJcodes
Copy link
Contributor Author

AJcodes commented Nov 24, 2021

Does the forthcoming AMD MI250 change the equation any?

@jchodera The optimisations in this MR do take the Mi250 into account

@peastman
Copy link
Member

Great! Let's do that.

The first thing you'll want to do is bring this up to date with the current OpenMM code. It looks like it branched off a while ago, and github reports conflicts that would prevent merging it. So merge the latest changes from the master branch, fix conflicts, and get it working again.

The next step is to move it into its own repository and allow it to be built independently. Mostly that should be straightforward. It's already an independent plugin, so you probably won't need to make any code changes. But the CMake scripts will require some reworking to separate it out. You might want to look at the example plugin, which can serve as an example for doing that.

Once that is done, the final step is to create a feedstock on conda-forge for packaging it. We can help with that once you get to that point.

@AJcodes
Copy link
Contributor Author

AJcodes commented Dec 2, 2021

Great! Let's do that.

The first thing you'll want to do is bring this up to date with the current OpenMM code. It looks like it branched off a while ago, and github reports conflicts that would prevent merging it. So merge the latest changes from the master branch, fix conflicts, and get it working again.

The next step is to move it into its own repository and allow it to be built independently. Mostly that should be straightforward. It's already an independent plugin, so you probably won't need to make any code changes. But the CMake scripts will require some reworking to separate it out. You might want to look at the example plugin, which can serve as an example for doing that.

Once that is done, the final step is to create a feedstock on conda-forge for packaging it. We can help with that once you get to that point.

@peastman I've resolved the merge conflict, so we're ready for the next step. By the way, would it be possible to host this code as a branch on the official OpenMM repository instead of a separate repository? We'd like to have the code be accessible to all OpenMM users without having it in another repository.

@peastman
Copy link
Member

peastman commented Dec 2, 2021

Since you'll be maintaining it, it would be best to have it in a repository you own. That repository won't include all of OpenMM. It will include only the code for the new plugin. This is similar to how we handle other plugins that are distributed separately, like https://github.com/openmm/openmm-torch and https://github.com/openmm/openmm-plumed/. Those repositories can also serve as useful examples.

@AJcodes
Copy link
Contributor Author

AJcodes commented Dec 3, 2021

@peastman One last question on the topic; would it be possible to have a repository in the OpenMM Github group (e.g. openmm/openmm-hip)?

@peastman
Copy link
Member

peastman commented Dec 3, 2021

I'd need to discuss it with all the PIs. Putting it in the openmm organization could imply it's code we maintain and support.

None of this will matter to users, of course. They'll just type

conda install -c conda-forge openmm-hip

@jchodera
Copy link
Member

jchodera commented Dec 3, 2021

@AJcodes : Huge thanks for all the effort in getting this fully working!

A few questions that will help us out when the OpenMM team meets soon:

  • Have you folks run the full set of OpenMM benchmark suite on any hardware? Especially interesting is the comparison of OpenCL and HIP
  • If we did bring this into the OpenMM GitHub org as a separate repo for the plugin, what kind of commitment to help maintain the code could be provided? If the commitment was limited, could we have a warning that this is experimental?
  • Is there any interest in having the HIP platform also available on the OpenMM core deployed across Folding@home, or is the primary interest in achieving top performance on high-end datacenter-grade GPUs?

If it would be easier to discuss some of these topics in a call, just reach out by email and I can help set something up.

@muziqaz
Copy link

muziqaz commented Dec 11, 2021

I also would like to thank @AJcodes for incredible work done regarding this. On top of that I would like to add few arguments for adding HIP support to OpenMM.
While OpenCL support is there for AMD GPUs, performance is really poor to put it politely. Moreover, AMD seem to have started to distance themselves from OpenCL support in past year. We are seeing up to 50% of Folding@Home (and other distributed computing projects) performance degradation going from driver version 21.3.2 (Windows) to anything after that. Even with 21.3.2 drivers AMD is far behind nVidia CUDA initiative, so add another 50% drop in performance, and we have no choice, but to exclude all the AMD hardware from future F@H projects. AMD seem to be completely oblivious to our cries about it. So I figure they are now fully concentrating on HIP instead of OpenCL. Having HIP support on OpenMM would save the existing base of AMD users at F@H (and other distributed computing projects), and it would not put off new users, which are plenty, due to AMD being very competitive across the board nowadays. We see incredible inroads done by AMD in various GPU markets, which haven't happened for very long time, if ever. It would be beneficial to OpenMM project to have HIP support, otherwise we end up looking stagnant in regards to AMD hardware by settling for OpenCL.
I know AMD themselves are big part of the issue here.
If there is a need of AMD hardware to do tests on it should be possible to make arrangements to get access to Vega64, Radeon 7, 5700xt and 6900xt (Windows only, though) remotely.
Regards

@xCaradhras
Copy link

No updates in almost 3 month...is this initiative dead?

@ex-rzr
Copy link
Contributor

ex-rzr commented Feb 7, 2022

No updates in almost 3 month...is this initiative dead?

No, it's not dead. You can check it: https://github.com/StreamHPC/openmm-hip

We discussed with OpenMM team: the new approach is to split HIP-related changes in two parts: the HIP backend as a plugin and some changes in OpenMM (required for HIP - in common kernels etc.).

I'll merge the recent changes to https://github.com/StreamHPC/openmm and resolve conflicts, then we'll update this PR (or perhaps create a new one).

@AJcodes
Copy link
Contributor Author

AJcodes commented Feb 8, 2022

@peastman @jchodera As discussed, @ex-rzr has completed the split. This PR now contains changes in OpenMM (in the common kernels, etc) that are required for HIP, and is ready for review

@peastman
Copy link
Member

peastman commented Feb 8, 2022

Thanks! I'll start going through it and making comments. It may take me a little while--there's a lot to look through!

Could you comment on your changes to the benchmarking script? Some of them are obvious changes to support HIP, but others seemed to be adding new features, and it wasn't clear what the reason for them was.

@@ -57,6 +57,20 @@ struct mm_int4 {
mm_int4(int x, int y, int z, int w) : x(x), y(y), z(z), w(w) {
}
};
struct mm_long2 {
long x, y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want this to be long long. See https://en.cppreference.com/w/cpp/language/types. In many compilers, long is only 32 bit.

@peastman
Copy link
Member

peastman commented Feb 8, 2022

One note on naming. We use the word "warp" to refer to a block of 32 threads, even when that doesn't match the SIMD width of the processor. It's kind of an abuse of notation, but we use it pretty widely. For example, the SYNC_WARPS macro ensures that groups of 32 threads are synchronized. On Intel GPUs where the SIMD width is only 16, that means synchronizing the entire block. It's a bit more than strictly required, and possibly hurts performance a bit. But not very much, and anyway, we aren't too concerned about getting the best possible performance on very low end GPUs.

for (int group = GROUP_ID; group < numParticleGroups; group += NUM_GROUPS) {
// The threads in this block work together to compute the center one group.

int firstIndex = groupOffsets[group];
int lastIndex = groupOffsets[group+1];
real3 center = make_real3(0);
for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += LOCAL_SIZE) {
for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += BLOCK_SIZE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is BLOCK_SIZE different from LOCAL_SIZE?

In other kernels where we need to define a macro for the thread block size, we call it THREAD_BLOCK_SIZE to be clear.

Comment on lines 1668 to 1669
replacements["BLOCK_SIZE"] = cc.intToString(this->blockSize);
replacements["WARP_SIZE"] = cc.intToString(cc.getSIMDWidth());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on my earlier comments, "warp size" is always defined as 32, regardless of the hardware. Be aware that getSIMDWidth() is not entirely reliable, since there's no standard mechanism for determining it in OpenCL. If we aren't sure what the width is, that method returns 1.

@gunnarre
Copy link

gunnarre commented Feb 8, 2022

We discussed with OpenMM team: the new approach is to split HIP-related changes in two parts: the HIP backend as a plugin and some changes in OpenMM (required for HIP - in common kernels etc.).

Has it been decided to integrate HIP support into the Folding at Home GPU core too?

void CommonCalcCustomNonbondedForceKernel::initInteractionGroups(const CustomNonbondedForce& force, const string& interactionSource, const vector<string>& tableTypes) {
// Process groups to form tiles.


const int tileSize = cc.getTileSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this class are based on the assumption that the tile size used for computing interaction groups is the same as the size used for standard nonbonded interactions. They both happen to be 32, but there's no reason they need to be the same. They're computed with different code using different neighbor lists. I think it's best to leave it as it is, with the size just hardcoded to 32. Some day we might (or might not) decide it would be useful to make that configurable, but either way it shouldn't be required to match the main nonbonded kernel.

@@ -3011,7 +3048,8 @@ void CommonCalcCustomGBForceKernel::initialize(const System& system, const Custo
pairValueDefines["NUM_ATOMS"] = cc.intToString(cc.getNumAtoms());
pairValueDefines["PADDED_NUM_ATOMS"] = cc.intToString(cc.getPaddedNumAtoms());
pairValueDefines["NUM_BLOCKS"] = cc.intToString(numAtomBlocks);
pairValueDefines["TILE_SIZE"] = "32";
pairValueDefines["TILE_SIZE"] = cc.intToString(tileSize);
pairValueDefines["tileflags"] = (tileSize > 32 ? "unsigned long" : "unsigned int");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be mm_ulong rather than unsigned long. long is 32 bits in CUDA, 64 bits in OpenCL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise in similar code below.

@@ -4390,17 +4430,19 @@ CommonCalcCustomManyParticleForceKernel::~CommonCalcCustomManyParticleForceKerne

void CommonCalcCustomManyParticleForceKernel::initialize(const System& system, const CustomManyParticleForce& force) {
ContextSelector selector(cc);
const int tileSize = cc.getTileSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another class that's computed with completely different code from standard nonbonded interactions. It doesn't need to use the same tile size they do.

// Create data structures used for the neighbor list.

int numAtomBlocks = (numRealParticles+31)/32;
const int tileSize = cc.getTileSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another one that shouldn't depend on the tile size used by the main nonbonded kernel.


// First loop: process tiles that contain exclusions.
#if !defined(USE_HIP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this change is about? I don't understand it.

@@ -12,7 +12,12 @@ KERNEL void findAtomGridIndex(GLOBAL const real4* RESTRICT posq, GLOBAL int2* RE
) {
// Compute the index of the grid point each atom is associated with.

#if !defined(USE_FLAT_KERNELS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is about?

Comment on lines 23 to 24
// Easier to cope with varying block / wavefront sizes w/o perf. penalty if
// expressed as a constexpr reduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that CM motion removal takes truly negligible time. It's not worth adding any complexity to optimize it, because there will be no practical benefit.

It allows to build and install common platform files even if
CUDA or OpenCL platforms are not built.
This is required for HIP platform (openmm-hip) if ROCm OpenCL
packages are not installed.
OPENMM_PYTHON_USER_INSTALL is OFF be default.
The HIP platform supports FFT backends, this commit moves
findLegalFFTDimension to ComputeContext, so platforms can have their own
implementations.
@ex-rzr
Copy link
Contributor

ex-rzr commented Jul 18, 2022

@peastman, could you check the current version?
I removed everything related to tile sizes and some other not very important code.

There are still some changes that are not mandatory:

  • 70dab08 - a more convenient way to install python packages for the current user only (we use it when multiple developers need to work on OpenMM without conflicts)
  • f34057e - the comment says what the change does. I added it after I spent an hour trying to understand why a changed kernel worked incorrectly. It turned out that during development I copied some code with commands like SHUFFLE_WARP_DATA and commented one copy with // which in the final code make only the first line commented. This check handles such situations.

We think it's ok to remove these changes if you don't like them, just say a word. I also can split them in separate PRs (though they are very small).

Thanks!

@peastman
Copy link
Member

Thanks! I'll take a look. Could you also remove the changes to benchmark.py for the moment, just because we have another open PR that rewrites large parts of that script? Once that's merged we can make any other changes to it in a separate PR.

skyreflectedinmirrors and others added 3 commits July 19, 2022 10:37
The generated code is not optimal, for example, the compiler generates
flat_load instructions instead of ds_read.
Force the compiler to use all registers for gridSpreadCharge and
gridInterpolateForce by limiting max waves per EU to 1 on CDNA GPUs,
RDNA GPUs work better without it.
Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. If we can avoid duplicating all the AtomData struct definitions, that would be preferable. Otherwise, I think it's ready to merge.

Comment on lines 3 to 5
#if defined(USE_HIP)

typedef struct alignas(16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the structures are identical other than alignment, what about writing it like this?

#if defined(USE_HIP)
    #define ALIGN alignas(16)
#else
    #define ALIGN
#endif

typedef struct ALIGN {

That way we don't repeat the structure definition, which risks someone changing it in one place but not the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! Done.

Regarding other structs: I can use the same approach with defines to hide the differences (alignment, paddings) but they also have a different order of fields. I'm worried to rearrange fields for all platforms because I'm not sure that it won't affect performance. I can profile GBSA and Amoeba benchmarks on a few Nvidia GPUs with CUDA: 3090, Titan V, 2060, but definitely this can't cover all devices where OpenMM is used.
In my opinion, it seems safer to keep separate definitions. Also it's likely that we'll be able to remove at least some of this commit in the future because (as far as I know) developers of the HIP compiler continue to work on improving code generation of shared memory accesses.

What do you think?

@peastman
Copy link
Member

So far as I know, the order of fields in a shared memory struct should never affect performance on an NVIDIA GPU. It's not like global memory, where the order could affect cache performance. If you test a few GPUs and confirm it doesn't change performance, that's good enough for me.

Manually rearrange fields, add paddings and force alignments to
have faster accesses to shared memory: ds_read and ds_write may
work slower if addresses are not aligned by 16 bytes.
@ex-rzr
Copy link
Contributor

ex-rzr commented Jul 22, 2022

Ok, I've combined structure definitions and profiled gbsa (computeBornSum, computeGBSAForce1), amoebagk and amoebapme (computeGKForces, computeInducedField, computeEDiffForce, computeElectrostatics) on those 3 devices: I don't see any performance drops caused by fields reordering.

@peastman
Copy link
Member

That looks good. Which means it's ready to merge!

@peastman peastman merged commit a39fa14 into openmm:master Jul 22, 2022
@DanielWicz
Copy link

DanielWicz commented Jan 31, 2023

In which version of OpenMM, HIP is expected to be used ? Is 8.0 already using it, or only OpenCL/CUDA ?

@ex-rzr
Copy link
Contributor

ex-rzr commented Jan 31, 2023

@DanielWicz

In which version of OpenMM, HIP is expected to be used ?

This PR contains changes in common files we needed for HIP support.
The HIP platform is a plugin which you can find here: https://github.com/amd/openmm-hip

There are instructions how to build it from sources or install a package from conda (I've noticed that OpenMM 8.0.0 has been released, I'll need to rebuild conda package for this version).

If you have any questions or problems, please open an issue in that repository (and ping me, just to be sure).

@egallicc
Copy link
Contributor

We are testing 8.0.0rc1 with openmm-hip building from sources downloaded from https://github.com/StreamHPC/openmm-hip. All tests are passing without changes.

@egallicc
Copy link
Contributor

egallicc commented Feb 1, 2023

All openmm-8.0.0rc1 + openmm-hip tests passed except TestHipFFTImplHipFFT{Single,Mixed,Double} because the executable could not be found.

@ex-rzr
Copy link
Contributor

ex-rzr commented Feb 1, 2023

Thanks, @egallicc

You're using RDNA GPUs, correct? What ROCm version, OS? (I'm just gathering statistics, as it's impossible to test everywhere)

TestHipFFTImplHipFFT{Single,Mixed,Double} because the executable could not be found

That's interesting. These tests had failures on older ROCm version, but they shouldn't fail because of missing executables. Can you post the log?

@egallicc
Copy link
Contributor

egallicc commented Feb 5, 2023

Yes, @ex-rzr
GPU: RX 6750 XT (with HSA_OVERRIDE_GFX_VERSION=10.3.0)
Host: Ubuntu 20.04
ROCm: 5.2.0

Test log: https://www.dropbox.com/s/k86vbmjo9hezo8q/LastTest-1.log?dl=0
(actually, it wasn't missing executables)

@ex-rzr
Copy link
Contributor

ex-rzr commented Feb 5, 2023

Ok, this is a known issue with rocFFT (a specific problem size): ROCm/hipFFT#26. It's fixed in recent releases of ROCm. But anyway, VkFFT is used by default, so it shouldn't be a problem.

@muziqaz
Copy link

muziqaz commented Mar 27, 2023

https://docs.google.com/spreadsheets/d/1_FMU8mKlWb4LEp3mOwsHwiwgSMKQWn2WxNxdp7QZsfg/edit?usp=sharing
OpenMM 8.0.0 performance comparison between OpenCL and HIP on Linux AMD. RX 550 is no longer supported neither with OpenCL nor HIP in Linux
Depending on time availability, I will try compiling newer OpenMM version with newer HIP versions (if available) to see if something was done to sort out 7900xtx performance (HIP side). Also Windows HIP SDK seems to be available, will have to check that out. These tests were run from standard conda env set up.
As you can see performance increase is beyond unbelievable.

@muziqaz
Copy link

muziqaz commented Apr 20, 2023

One more update for you guys:
We are testing 1.2m atom project on F@H.
I ran it through HIP on 6900xt Linux Openmm 8.0:
OpenCL 8.2ns/day
HIP - 22.3ns/day
For comparison 4070ti/3080ti on CUDA - 17ns/day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet