New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final HIP Platform implementation for AMD GPUs on ROCm #3338
Conversation
Thanks for your work on this! Let's discuss some options for moving forward. We had a lot of discussion on the first PR. See especially my comments at #2736 (comment). Merging this into the main OpenMM code base isn't going to happen in the foreseeable future. It's a huge amount of new code (over 43,000 lines added) which I would be accepting responsibility for maintaining, and I simply don't have the bandwidth to do it. Given the limited number of our users with AMD GPUs, and the fact that we already support them with OpenCL, I just can't justify adding it. If this is something you're willing to maintain long term, a much better option would be to keep it in its own repository as a separate plugin. If we can make it conda installable, it will be very easy for users to get, and it will automatically be detected and used at runtime. We'll be happy to work with you in getting all of that set up. |
out of curiosity, what is the speed difference of HIP against the OpenCL
implementation and what against a RTX3090?
…On Thu, Nov 18, 2021 at 6:09 PM Peter Eastman ***@***.***> wrote:
Thanks for your work on this! Let's discuss some options for moving
forward.
We had a lot of discussion on the first PR. See especially my comments at #2736
(comment)
<#2736 (comment)>.
Merging this into the main OpenMM code base isn't going to happen in the
foreseeable future. It's a huge amount of new code (over 43,000 lines
added) which I would be accepting responsibility for maintaining, and I
simply don't have the bandwidth to do it. Given the limited number of our
users with AMD GPUs, and the fact that we already support them with OpenCL,
I just can't justify adding it.
If this is something you're willing to maintain long term, a much better
option would be to keep it in its own repository as a separate plugin. If
we can make it conda installable, it will be very easy for users to get,
and it will automatically be detected and used at runtime. We'll be happy
to work with you in getting all of that set up.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3338 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOR4DYMRS6UK2DQQY7DUMUXMPANCNFSM5IJGMESA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Does the forthcoming AMD MI250 change the equation any? |
@peastman Thanks for your reply, perhaps we could get started with making this conda installable. What are the next steps for this? |
Great! Let's do that. The first thing you'll want to do is bring this up to date with the current OpenMM code. It looks like it branched off a while ago, and github reports conflicts that would prevent merging it. So merge the latest changes from the master branch, fix conflicts, and get it working again. The next step is to move it into its own repository and allow it to be built independently. Mostly that should be straightforward. It's already an independent plugin, so you probably won't need to make any code changes. But the CMake scripts will require some reworking to separate it out. You might want to look at the example plugin, which can serve as an example for doing that. Once that is done, the final step is to create a feedstock on conda-forge for packaging it. We can help with that once you get to that point. |
@peastman I've resolved the merge conflict, so we're ready for the next step. By the way, would it be possible to host this code as a branch on the official OpenMM repository instead of a separate repository? We'd like to have the code be accessible to all OpenMM users without having it in another repository. |
Since you'll be maintaining it, it would be best to have it in a repository you own. That repository won't include all of OpenMM. It will include only the code for the new plugin. This is similar to how we handle other plugins that are distributed separately, like https://github.com/openmm/openmm-torch and https://github.com/openmm/openmm-plumed/. Those repositories can also serve as useful examples. |
@peastman One last question on the topic; would it be possible to have a repository in the OpenMM Github group (e.g. |
I'd need to discuss it with all the PIs. Putting it in the openmm organization could imply it's code we maintain and support. None of this will matter to users, of course. They'll just type
|
@AJcodes : Huge thanks for all the effort in getting this fully working! A few questions that will help us out when the OpenMM team meets soon:
If it would be easier to discuss some of these topics in a call, just reach out by email and I can help set something up. |
I also would like to thank @AJcodes for incredible work done regarding this. On top of that I would like to add few arguments for adding HIP support to OpenMM. |
No updates in almost 3 month...is this initiative dead? |
No, it's not dead. You can check it: https://github.com/StreamHPC/openmm-hip We discussed with OpenMM team: the new approach is to split HIP-related changes in two parts: the HIP backend as a plugin and some changes in OpenMM (required for HIP - in common kernels etc.). I'll merge the recent changes to https://github.com/StreamHPC/openmm and resolve conflicts, then we'll update this PR (or perhaps create a new one). |
331654d
to
21f05f9
Compare
Thanks! I'll start going through it and making comments. It may take me a little while--there's a lot to look through! Could you comment on your changes to the benchmarking script? Some of them are obvious changes to support HIP, but others seemed to be adding new features, and it wasn't clear what the reason for them was. |
@@ -57,6 +57,20 @@ struct mm_int4 { | |||
mm_int4(int x, int y, int z, int w) : x(x), y(y), z(z), w(w) { | |||
} | |||
}; | |||
struct mm_long2 { | |||
long x, y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want this to be long long
. See https://en.cppreference.com/w/cpp/language/types. In many compilers, long
is only 32 bit.
One note on naming. We use the word "warp" to refer to a block of 32 threads, even when that doesn't match the SIMD width of the processor. It's kind of an abuse of notation, but we use it pretty widely. For example, the |
for (int group = GROUP_ID; group < numParticleGroups; group += NUM_GROUPS) { | ||
// The threads in this block work together to compute the center one group. | ||
|
||
int firstIndex = groupOffsets[group]; | ||
int lastIndex = groupOffsets[group+1]; | ||
real3 center = make_real3(0); | ||
for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += LOCAL_SIZE) { | ||
for (int index = LOCAL_ID; index < lastIndex-firstIndex; index += BLOCK_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is BLOCK_SIZE
different from LOCAL_SIZE
?
In other kernels where we need to define a macro for the thread block size, we call it THREAD_BLOCK_SIZE
to be clear.
replacements["BLOCK_SIZE"] = cc.intToString(this->blockSize); | ||
replacements["WARP_SIZE"] = cc.intToString(cc.getSIMDWidth()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on my earlier comments, "warp size" is always defined as 32, regardless of the hardware. Be aware that getSIMDWidth()
is not entirely reliable, since there's no standard mechanism for determining it in OpenCL. If we aren't sure what the width is, that method returns 1.
Has it been decided to integrate HIP support into the Folding at Home GPU core too? |
void CommonCalcCustomNonbondedForceKernel::initInteractionGroups(const CustomNonbondedForce& force, const string& interactionSource, const vector<string>& tableTypes) { | ||
// Process groups to form tiles. | ||
|
||
|
||
const int tileSize = cc.getTileSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this class are based on the assumption that the tile size used for computing interaction groups is the same as the size used for standard nonbonded interactions. They both happen to be 32, but there's no reason they need to be the same. They're computed with different code using different neighbor lists. I think it's best to leave it as it is, with the size just hardcoded to 32. Some day we might (or might not) decide it would be useful to make that configurable, but either way it shouldn't be required to match the main nonbonded kernel.
@@ -3011,7 +3048,8 @@ void CommonCalcCustomGBForceKernel::initialize(const System& system, const Custo | |||
pairValueDefines["NUM_ATOMS"] = cc.intToString(cc.getNumAtoms()); | |||
pairValueDefines["PADDED_NUM_ATOMS"] = cc.intToString(cc.getPaddedNumAtoms()); | |||
pairValueDefines["NUM_BLOCKS"] = cc.intToString(numAtomBlocks); | |||
pairValueDefines["TILE_SIZE"] = "32"; | |||
pairValueDefines["TILE_SIZE"] = cc.intToString(tileSize); | |||
pairValueDefines["tileflags"] = (tileSize > 32 ? "unsigned long" : "unsigned int"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be mm_ulong
rather than unsigned long
. long
is 32 bits in CUDA, 64 bits in OpenCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And likewise in similar code below.
@@ -4390,17 +4430,19 @@ CommonCalcCustomManyParticleForceKernel::~CommonCalcCustomManyParticleForceKerne | |||
|
|||
void CommonCalcCustomManyParticleForceKernel::initialize(const System& system, const CustomManyParticleForce& force) { | |||
ContextSelector selector(cc); | |||
const int tileSize = cc.getTileSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another class that's computed with completely different code from standard nonbonded interactions. It doesn't need to use the same tile size they do.
// Create data structures used for the neighbor list. | ||
|
||
int numAtomBlocks = (numRealParticles+31)/32; | ||
const int tileSize = cc.getTileSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another one that shouldn't depend on the tile size used by the main nonbonded kernel.
|
||
// First loop: process tiles that contain exclusions. | ||
#if !defined(USE_HIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this change is about? I don't understand it.
platforms/common/src/kernels/pme.cc
Outdated
@@ -12,7 +12,12 @@ KERNEL void findAtomGridIndex(GLOBAL const real4* RESTRICT posq, GLOBAL int2* RE | |||
) { | |||
// Compute the index of the grid point each atom is associated with. | |||
|
|||
#if !defined(USE_FLAT_KERNELS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this is about?
// Easier to cope with varying block / wavefront sizes w/o perf. penalty if | ||
// expressed as a constexpr reduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that CM motion removal takes truly negligible time. It's not worth adding any complexity to optimize it, because there will be no practical benefit.
It allows to build and install common platform files even if CUDA or OpenCL platforms are not built. This is required for HIP platform (openmm-hip) if ROCm OpenCL packages are not installed.
OPENMM_PYTHON_USER_INSTALL is OFF be default.
The HIP platform supports FFT backends, this commit moves findLegalFFTDimension to ComputeContext, so platforms can have their own implementations.
@peastman, could you check the current version? There are still some changes that are not mandatory:
We think it's ok to remove these changes if you don't like them, just say a word. I also can split them in separate PRs (though they are very small). Thanks! |
Thanks! I'll take a look. Could you also remove the changes to benchmark.py for the moment, just because we have another open PR that rewrites large parts of that script? Once that's merged we can make any other changes to it in a separate PR. |
The generated code is not optimal, for example, the compiler generates flat_load instructions instead of ds_read.
Force the compiler to use all registers for gridSpreadCharge and gridInterpolateForce by limiting max waves per EU to 1 on CDNA GPUs, RDNA GPUs work better without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. If we can avoid duplicating all the AtomData struct definitions, that would be preferable. Otherwise, I think it's ready to merge.
#if defined(USE_HIP) | ||
|
||
typedef struct alignas(16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the structures are identical other than alignment, what about writing it like this?
#if defined(USE_HIP)
#define ALIGN alignas(16)
#else
#define ALIGN
#endif
typedef struct ALIGN {
That way we don't repeat the structure definition, which risks someone changing it in one place but not the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! Done.
Regarding other structs: I can use the same approach with defines to hide the differences (alignment, paddings) but they also have a different order of fields. I'm worried to rearrange fields for all platforms because I'm not sure that it won't affect performance. I can profile GBSA and Amoeba benchmarks on a few Nvidia GPUs with CUDA: 3090, Titan V, 2060, but definitely this can't cover all devices where OpenMM is used.
In my opinion, it seems safer to keep separate definitions. Also it's likely that we'll be able to remove at least some of this commit in the future because (as far as I know) developers of the HIP compiler continue to work on improving code generation of shared memory accesses.
What do you think?
So far as I know, the order of fields in a shared memory struct should never affect performance on an NVIDIA GPU. It's not like global memory, where the order could affect cache performance. If you test a few GPUs and confirm it doesn't change performance, that's good enough for me. |
Manually rearrange fields, add paddings and force alignments to have faster accesses to shared memory: ds_read and ds_write may work slower if addresses are not aligned by 16 bytes.
Ok, I've combined structure definitions and profiled gbsa (computeBornSum, computeGBSAForce1), amoebagk and amoebapme (computeGKForces, computeInducedField, computeEDiffForce, computeElectrostatics) on those 3 devices: I don't see any performance drops caused by fields reordering. |
That looks good. Which means it's ready to merge! |
In which version of OpenMM, HIP is expected to be used ? Is 8.0 already using it, or only OpenCL/CUDA ? |
This PR contains changes in common files we needed for HIP support. There are instructions how to build it from sources or install a package from conda (I've noticed that OpenMM 8.0.0 has been released, I'll need to rebuild conda package for this version). If you have any questions or problems, please open an issue in that repository (and ping me, just to be sure). |
We are testing 8.0.0rc1 with openmm-hip building from sources downloaded from https://github.com/StreamHPC/openmm-hip. All tests are passing without changes. |
All openmm-8.0.0rc1 + openmm-hip tests passed except |
Thanks, @egallicc You're using RDNA GPUs, correct? What ROCm version, OS? (I'm just gathering statistics, as it's impossible to test everywhere)
That's interesting. These tests had failures on older ROCm version, but they shouldn't fail because of missing executables. Can you post the log? |
Yes, @ex-rzr Test log: https://www.dropbox.com/s/k86vbmjo9hezo8q/LastTest-1.log?dl=0 |
Ok, this is a known issue with rocFFT (a specific problem size): ROCm/hipFFT#26. It's fixed in recent releases of ROCm. But anyway, VkFFT is used by default, so it shouldn't be a problem. |
https://docs.google.com/spreadsheets/d/1_FMU8mKlWb4LEp3mOwsHwiwgSMKQWn2WxNxdp7QZsfg/edit?usp=sharing |
One more update for you guys: |
Hi Dr. Eastman and co.,
This MR continues the work that was done here, where we made sure that the current code is up to date with the latest release of the OpenMM master branch, and supports all the same functionality as the current CUDA platform (this includes all of the CUDA-specific plugins too, such as amoeba).
Additionally, further optimisations have been made in this MR to ensure the best possible performance on AMD GPUs. Such optimisations include load-balancing, the use of shuffle operations, etc that are specific to the HIP platform.
As for testing, we focused on two sets of tests; the unit tests within the repository and the tests from
openmmtools
. On our side, the unit tests within the repository all pass. I've attached a log from the tests we ran fromopenmmtools
which we ran on a Mi100 (64 tests pass, and 4 fail, but this is what we also got when running the tests on a RTX3090).2021-11-10-openmmtools-0-mi100-fixed-amoeba.log
Given that the scope of this PR is extensive, I'd like to discuss the next steps for this (e.g. how can we help to ease the reviewing process)