Investigate slow OpenCL performance on AMD #3937

peastman · 2023-01-31T18:14:36Z

This is to continue the discussion that started in #3934. On AMD GPUs, the OpenCL platform is sometimes several times slower than the HIP platform. We're trying to figure out why. Much of the slowness seems to be due to large gaps between kernels where the GPU is sitting idle.

peastman · 2023-01-31T18:15:48Z

integrateLangevinMiddlePart3 -> calcCenterOfMassMomentum

That's a mystery. There shouldn't be anything happening in between. No data transfer, no synchronization. We just launch one kernel, then launch the other.

findBlocksWithInteractions -> computeBornSum

In between those we do an enqueueReadBuffer() to transfer data back to the host. It's specified as non-blocking, and we don't wait on the event until much later, after computeNonbonded has finished. But perhaps it's somehow making it a blocking call instead?

computeNonbonded -> reduceForces

This is where we wait on that event. But that doesn't make sense as the explanation. We already had one long pause, and transferring a few bytes from device to host shouldn't take hundreds of microseconds.

philipturner · 2023-01-31T18:22:21Z

There is one way to quickly test the theory. Perform multiple instances of the simulation in parallel, utilizing multiple CPU cores. I hypothesize that running two simulations will double the throughput in units of simulation*ns/day.

@peastman I suspect this is harming performance on GPUs besides AMD. You or I could test the hypothesis on an M1 GPU.

peastman · 2023-01-31T18:51:02Z

It's clear that the gaps are hurting performance. The question is, what causes them?

philipturner · 2023-01-31T18:58:35Z

How about we narrow down the problem? For example, remove one of the kernels that isn't causing a gap. See whether the gap persists. Repeat the procedure, you get the idea. This will narrow down and validate that, in isolation, e.g. computeNonbonded and reduceForces will always have a massive gap. You could also start with the final two ab initio, but I hypothesize the gap won't happen.

bdenhollander · 2023-01-31T22:31:25Z

I commented out the enqueueReadBuffer and the code that relied on it. Performance is through the roof on gbsa, 410 ns/day -> 3030 ns/day!

void OpenCLNonbondedUtilities::computeInteractions(int forceGroups, bool includeForces, bool includeEnergy) {
...
  //downloadCountEvent.wait()
  //updateNeighborListSize();
}

The gap between computeNonbonded and reduceForces is eliminated. I drew a green line where it would normally be.

bdenhollander · 2023-01-31T23:09:54Z

integrateLangevinMiddlePart3 has this little gem:

void CommonIntegrateLangevinMiddleStepKernel::execute(ContextImpl& context, const LangevinMiddleIntegrator& integrator) {
...
// Reduce UI lag.
    
#ifdef WIN32
    cc.flushQueue();
#endif
}

Commenting out cc.flushQueue(); eliminates the integrateLangevinMiddlePart3 -> calcCenterOfMassMomentum gap but doesn't have much impact on performance. It is the shortest of the 3 gaps.

philipturner · 2023-01-31T23:46:54Z

I commented out the enqueueReadBuffer and the code that relied on it. Performance is through the roof on gbsa, 410 ns/day -> 3030 ns/day!

You commented out the part that rebuilds nearest neighbor lists, so this is misleading.

peastman · 2023-01-31T23:48:20Z

I commented out the enqueueReadBuffer and the code that relied on it. Performance is through the roof on gbsa, 410 ns/day -> 3030 ns/day!

Sadly, I don't think that's real. It just means you broke it!

We don't know in advance how much memory we'll need for the neighbor list. We pick an amount that we hope will be enough, but often isn't. Since the GBSA benchmark uses a larger cutoff than the others, it isn't.

That's ok. The nonbonded kernel spots that there wasn't enough memory for the neighbor list and returns without doing much. updateNeighborListSize() allocates more memory and marks that the force evaluation was invalid and we need to repeat it. But you removed the call to it. Instead it just keeps running timesteps with the nonbonded kernel always returning early and not computing most of the interactions.

Commenting out cc.flushQueue(); eliminates the integrateLangevinMiddlePart3 -> calcCenterOfMassMomentum gap but doesn't have much impact on performance.

That only happens on Windows. It was added for Folding@home. Previously it had a tendency to load the GPU so heavily that the computer became unresponsive and people couldn't do anything else while it was running. We added the flush to make it break and ensure other processes would have time to run. But only on Windows, since on other platforms it probably isn't running as part of Folding@home.

philipturner · 2023-02-01T00:00:55Z

I suggested a better way to examine the problem, run two simulations in parallel. This requires a more complex setup than benchmark.py. It's particularly important to my use case because, if we can't resolve the latency issue, I can still extract full throughput.

Benchmarks on M1 Max with the Metal plugin. Some benchmarks show significant underutilization, even with correct SIMD width. Remember that before, apoa1rf was using 13 W.

Benchmark	Peak Power	Theoretical Limit	Theoretical Speedup
gbsa	13.2 W	40-60 W	3.03-4.54x
rf	19.6 W	40-60 W	2.04-3.06x
pme	16.6 W	40-60 W	2.41-3.61x
apoa1rf	30.9 W	40-60 W	1.29-1.94x
apoa1pme	23.5 W	40-60 W	1.70-2.55x
apoa1ljpme	23.2 W	40-60 W	1.72-2.59x
amoebagk	15.2 W	40-60 W	2.63-3.95x
amoebapme	17.1 W	40-60 W	2.34-3.51x
amber20-cellulose	27.4 W	40-60 W	1.46-2.19x
amber20-stmv	27.1 W	40-60 W	1.48-2.21x

Benchmark	ns/day	Theoretical Limit (Lower)	Theoretical Limit (Upper)
gbsa	379.4	1149.6	1722.5
rf	300.2	612.4	918.6
pme	198.2	477.7	715.5
apoa1rf	112.8	145.5	218.8
apoa1pme	66.4	112.9	169.3
apoa1ljpme	49.7	85.5	128.7
amoebagk	1.18	3.10	4.66
amoebapme	3.93	9.20	13.79
amber20-cellulose	14.18	20.70	31.05
amber20-stmv	3.83	5.67	8.46

bdenhollander · 2023-02-01T00:03:28Z

I know I broke the simulation...but in the process I think I confirmed enqueueReadBuffer and downloadCountEvent.wait() introduce pauses.

peastman · 2023-02-01T00:13:34Z

The PME benchmark is a better test for this. It never needs to resize the neighbor list, so we can safely remove the related calls and see what difference it makes. Here's what I get. (I finally got my AMD GPU working again!)

Original: 320 ns/day
Download the interaction count, but don't wait on the event: 340 ns/day
Don't download it: 406 ns/day

The download causes about a 25% slowdown, even though it's supposed to happen asynchronously and nothing ever depends on it having completed.

philipturner · 2023-02-01T01:04:23Z

gbsa, rf, and amoebapme seem messed up. Elsewhere, there was a minor performance boost. Faster simulations (where latency would impact more) received a greater speedup, indicating it was from not waiting.

Benchmark	Before ns/day	After ns/day	Before Peak W	After Peak W
gbsa	379.4	1145.2	13.2	5.2
rf	300.2	silently fails	19.6	silently fails
pme	198.2	243.5	16.6	20.9
apoa1rf	112.8	127.7	30.9	36.4
apoa1pme	66.4	71.9	23.5	26.1
apoa1ljpme	49.7	53.2	23.2	25.6
amoebagk	1.181	1.183	15.2	15.6
amoebapme*	3.97	4.59	17.3	13.1
amber20-cellulose	14.18	14.41	27.4	27.8
amber20-stmv*	3.78	3.77	27.2	27.4

*These were (re-)run for 3 trials because the data seemed strange. Everything else was 1 trial.

Benchmark	ns/day Change	Power Change	(Δ ns/day)/(Δ W)
gbsa	3.02x	0.39x	7.66
rf	n/a	n/a	n/a
pme	1.23x	1.26x	0.98
apoa1rf	1.13x	1.18x	0.96
apoa1pme	1.08x	1.11x	0.97
apoa1ljpme	1.07x	1.10x	0.97
amoebagk	1.00x	1.03x	0.98
amoebapme	1.16x	0.76x	1.53
amber20-cellulose	1.02x	1.01x	1.00
amber20-stmv	1.00x	1.01x	0.99

bdenhollander · 2023-02-01T04:51:54Z

Original: 320 ns/day
Download the interaction count, but don't wait on the event: 340 ns/day
Don't download it: 406 ns/day

My results are similar:

Original: 300 ns/day
Download the interaction count, but don't wait on the event: 320 ns/day
Don't download it: 444 ns/day

The PME benchmark is a better test for this. It never needs to resize the neighbor list, so we can safely remove the related calls and see what difference it makes.

Can this be known ahead of time in more situations? If so, wrapping the enqueueReadBuffer in an if statement would be a good first step.

philipturner · 2023-02-01T05:05:20Z

Can this be known ahead of time in more situations? If so, wrapping the enqueueReadBuffer in an if statement would be a good first step.

Is this also the case in Reaction Field, the primary benchmark I'm measuring progress with?

Can this be known ahead of time in more situations? If so, wrapping the enqueueReadBuffer in an if statement would be a good first step.

Before doing this, we should figure out what's happening with AMOEBA PME.

peastman · 2023-02-01T05:20:47Z

Unfortunately, it can never be known ahead of time. It also monitors the number of interactions for clues when it should reorder the molecules. That's important for performance in certain situations.

philipturner · 2023-02-01T05:56:16Z

@egallicc would you mind comparing power consumption to ns/day on HIP like I did, but without erasing downloadCountEvent? Then we can investigate whether HIP reached a genuine 1000 ns/day on RX 6750 XT, by comparing Δns/day to ΔW.

ex-rzr · 2023-02-01T06:07:50Z

Are you sure that the gap is not caused by interactingTiles.resize and interactingAtoms.resize? Do you see it in other iterations?

bdenhollander · 2023-02-01T11:53:11Z

The download causes about a 25% slowdown, even though it's supposed to happen asynchronously and nothing ever depends on it having completed.

The C++ call is non-blocking but the GPU still processes the command queue in order. I think the GPU waits for the data transfer to complete before resuming compute on the next kernel.

philipturner · 2023-02-01T11:58:48Z

Is this the data transfer from GPU -> CPU? Perhaps it's avoidable using indirect command buffers in Metal, HIP, CUDA, OpenCL 2.0. Have the GPU encode any commands depending on the value of pinnedCountMemory and never request it CPU-side.

Also how large is the buffer? It might take significant time over PCIe. If this hypothesis is correct, the gap you highlighted won't appear on M1 Pro. Also we could profile how long it takes the CPU to encode kernels (very expensive).

ex-rzr · 2023-02-01T12:31:21Z

Also how large is the buffer?

1 uint

What happens if pinnedCountMemory is a regular memory (new or malloc) instead of this?

    pinnedCountBuffer = new cl::Buffer(context.getContext(), CL_MEM_ALLOC_HOST_PTR, sizeof(unsigned int));
    pinnedCountMemory = (unsigned int*) context.getQueue().enqueueMapBuffer(*pinnedCountBuffer, CL_TRUE, CL_MAP_READ, 0, sizeof(int));

ex-rzr · 2023-02-01T14:23:38Z

gbsa on HIP, MI210, the gap between findBlocksWithInteractions and computeBornSum is about 30 us (between memcopy and computeBornSum - 10 us)

Unless it's the first iteration where OpenMM loads modules with kernels:

philipturner · 2023-02-01T14:34:16Z

OpenCL is encoding the findInteractionsWithBlocks kernel, issuing an event right after, then immediately requesting everything to finish. You're stalling the command stream just to read one value. With Metal that would invoke a 150+ us penalty.

ex-rzr · 2023-02-01T14:38:34Z

What do you mean by "encoding"?

philipturner · 2023-02-01T14:41:25Z

I mean issuing the command, making calls into a GPGPU API that the GPU will then execute. I think my analysis was slightly wrong; the OpenCL CPU-side code encodes the force kernel before waiting. That shouldn't starve the GPU unless the driver refuses to flush cached-up commands, when you ask it to wait on an event. We could try forcing a clFlush command before waiting.

philipturner · 2023-02-01T14:54:29Z

Perhaps you could move this entire function onto the GPU (it seems parallelizable). Have the GPU write certain buffers, which it reads in another kernel. You have persistent GPU threads so perhaps you don't need advanced indirect dispatch API to specify dispatch params.

openmm/platforms/common/src/ComputeContext.cpp

Lines 488 to 655 in a056d5a

    
           template <class Real, class Real4, class Mixed, class Mixed4> 
        
           void ComputeContext::reorderAtomsImpl() { 
        
               // Find the range of positions and the number of bins along each axis. 
        
               vector<Real4> oldPosq(paddedNumAtoms); 
        
               vector<Real4> oldPosqCorrection(paddedNumAtoms); 
        
               vector<Mixed4> oldVelm(paddedNumAtoms); 
        
               getPosq().download(oldPosq); 
        
               getVelm().download(oldVelm); 
        
               if (getUseMixedPrecision()) 
        
                   getPosqCorrection().download(oldPosqCorrection); 
        
               Real minx = oldPosq[0].x, maxx = oldPosq[0].x; 
        
               Real miny = oldPosq[0].y, maxy = oldPosq[0].y; 
        
               Real minz = oldPosq[0].z, maxz = oldPosq[0].z; 
        
               Vec3 periodicBoxX, periodicBoxY, periodicBoxZ; 
        
               getPeriodicBoxVectors(periodicBoxX, periodicBoxY, periodicBoxZ); 
        
               Vec3 invPeriodicBoxSize(1.0/periodicBoxX[0], 1.0/periodicBoxY[1], 1.0/periodicBoxZ[2]); 
        
               if (getNonbondedUtilities().getUsePeriodic()) { 
        
                   minx = miny = minz = 0.0; 
        
                   maxx = periodicBoxX[0]; 
        
                   maxy = periodicBoxY[1]; 
        
                   maxz = periodicBoxZ[2]; 
        
               } 
        
               else { 
        
                   for (int i = 1; i < numAtoms; i++) { 
        
                       const Real4& pos = oldPosq[i]; 
        
                       minx = min(minx, pos.x); 
        
                       maxx = max(maxx, pos.x); 
        
                       miny = min(miny, pos.y); 
        
                       maxy = max(maxy, pos.y); 
        
                       minz = min(minz, pos.z); 
        
                       maxz = max(maxz, pos.z); 
        
                   } 
        
               } 
        
               // Loop over each group of identical molecules and reorder them. 
        
               vector<int> originalIndex(numAtoms); 
        
               vector<Real4> newPosq(paddedNumAtoms, Real4(0,0,0,0)); 
        
               vector<Real4> newPosqCorrection(paddedNumAtoms, Real4(0,0,0,0)); 
        
               vector<Mixed4> newVelm(paddedNumAtoms, Mixed4(0,0,0,0)); 
        
               vector<mm_int4> newCellOffsets(numAtoms); 
        
               for (auto& mol : moleculeGroups) { 
        
                   // Find the center of each molecule. 
        
                   int numMolecules = mol.offsets.size(); 
        
                   vector<int>& atoms = mol.atoms; 
        
                   vector<Real4> molPos(numMolecules); 
        
                   Real invNumAtoms = (Real) (1.0/atoms.size()); 
        
                   for (int i = 0; i < numMolecules; i++) { 
        
                       molPos[i].x = 0.0f; 
        
                       molPos[i].y = 0.0f; 
        
                       molPos[i].z = 0.0f; 
        
                       for (int j = 0; j < (int)atoms.size(); j++) { 
        
                           int atom = atoms[j]+mol.offsets[i]; 
        
                           const Real4& pos = oldPosq[atom]; 
        
                           molPos[i].x += pos.x; 
        
                           molPos[i].y += pos.y; 
        
                           molPos[i].z += pos.z; 
        
                       } 
        
                       molPos[i].x *= invNumAtoms; 
        
                       molPos[i].y *= invNumAtoms; 
        
                       molPos[i].z *= invNumAtoms; 
        
                       if (molPos[i].x != molPos[i].x) 
        
                           throw OpenMMException("Particle coordinate is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan"); 
        
                   } 
        
                   if (getNonbondedUtilities().getUsePeriodic()) { 
        
                       // Move each molecule position into the same box. 
        
                       for (int i = 0; i < numMolecules; i++) { 
        
                           Real4 center = molPos[i]; 
        
                           int zcell = (int) floor(center.z*invPeriodicBoxSize[2]); 
        
                           center.x -= zcell*periodicBoxZ[0]; 
        
                           center.y -= zcell*periodicBoxZ[1]; 
        
                           center.z -= zcell*periodicBoxZ[2]; 
        
                           int ycell = (int) floor(center.y*invPeriodicBoxSize[1]); 
        
                           center.x -= ycell*periodicBoxY[0]; 
        
                           center.y -= ycell*periodicBoxY[1]; 
        
                           int xcell = (int) floor(center.x*invPeriodicBoxSize[0]); 
        
                           center.x -= xcell*periodicBoxX[0]; 
        
                           if (xcell != 0 || ycell != 0 || zcell != 0) { 
        
                               Real dx = molPos[i].x-center.x; 
        
                               Real dy = molPos[i].y-center.y; 
        
                               Real dz = molPos[i].z-center.z; 
        
                               molPos[i] = center; 
        
                               for (int j = 0; j < (int) atoms.size(); j++) { 
        
                                   int atom = atoms[j]+mol.offsets[i]; 
        
                                   Real4 p = oldPosq[atom]; 
        
                                   p.x -= dx; 
        
                                   p.y -= dy; 
        
                                   p.z -= dz; 
        
                                   oldPosq[atom] = p; 
        
                                   posCellOffsets[atom].x -= xcell; 
        
                                   posCellOffsets[atom].y -= ycell; 
        
                                   posCellOffsets[atom].z -= zcell; 
        
                               } 
        
                           } 
        
                       } 
        
                   } 
        
                   // Select a bin for each molecule, then sort them by bin. 
        
                   bool useHilbert = (numMolecules > 5000 || atoms.size() > 8); // For small systems, a simple zigzag curve works better than a Hilbert curve. 
        
                   Real binWidth; 
        
                   if (useHilbert) 
        
                       binWidth = (Real) (max(max(maxx-minx, maxy-miny), maxz-minz)/255.0); 
        
                   else 
        
                       binWidth = (Real) (0.2*getNonbondedUtilities().getMaxCutoffDistance()); 
        
                   Real invBinWidth = (Real) (1.0/binWidth); 
        
                   int xbins = 1 + (int) ((maxx-minx)*invBinWidth); 
        
                   int ybins = 1 + (int) ((maxy-miny)*invBinWidth); 
        
                   vector<pair<int, int> > molBins(numMolecules); 
        
                   bitmask_t coords[3]; 
        
                   for (int i = 0; i < numMolecules; i++) { 
        
                       int x = (int) ((molPos[i].x-minx)*invBinWidth); 
        
                       int y = (int) ((molPos[i].y-miny)*invBinWidth); 
        
                       int z = (int) ((molPos[i].z-minz)*invBinWidth); 
        
                       int bin; 
        
                       if (useHilbert) { 
        
                           coords[0] = x; 
        
                           coords[1] = y; 
        
                           coords[2] = z; 
        
                           bin = (int) hilbert_c2i(3, 8, coords); 
        
                       } 
        
                       else { 
        
                           int yodd = y&1; 
        
                           int zodd = z&1; 
        
                           bin = z*xbins*ybins; 
        
                           bin += (zodd ? ybins-y : y)*xbins; 
        
                           bin += (yodd ? xbins-x : x); 
        
                       } 
        
                       molBins[i] = pair<int, int>(bin, i); 
        
                   } 
        
                   sort(molBins.begin(), molBins.end()); 
        
                   // Reorder the atoms. 
        
                   for (int i = 0; i < numMolecules; i++) { 
        
                       for (int atom : atoms) { 
        
                           int oldIndex = mol.offsets[molBins[i].second]+atom; 
        
                           int newIndex = mol.offsets[i]+atom; 
        
                           originalIndex[newIndex] = atomIndex[oldIndex]; 
        
                           newPosq[newIndex] = oldPosq[oldIndex]; 
        
                           if (getUseMixedPrecision()) 
        
                               newPosqCorrection[newIndex] = oldPosqCorrection[oldIndex]; 
        
                           newVelm[newIndex] = oldVelm[oldIndex]; 
        
                           newCellOffsets[newIndex] = posCellOffsets[oldIndex]; 
        
                       } 
        
                   } 
        
               } 
        
               // Update the arrays. 
        
               ContextSelector selector(*this); 
        
               for (int i = 0; i < numAtoms; i++) { 
        
                   atomIndex[i] = originalIndex[i]; 
        
                   posCellOffsets[i] = newCellOffsets[i]; 
        
               } 
        
               getPosq().upload(newPosq); 
        
               if (getUseMixedPrecision()) 
        
                   getPosqCorrection().upload(newPosqCorrection); 
        
               getVelm().upload(newVelm); 
        
               getAtomIndexArray().upload(atomIndex); 
        
               for (auto listener : reorderListeners) 
        
                   listener->execute(); 
        
           }

This removes the need to break the command stream. If you don't want to find interactions, simply return early inside the appropriate kernel.

egallicc · 2023-02-01T15:01:14Z

@egallicc would you mind comparing power consumption to ns/day on HIP like I did, but without erasing downloadCountEvent? Then we can investigate whether HIP reached a genuine 1000 ns/day on RX 6750 XT, by comparing Δns/day to ΔW.

I will try but not before the weekend. Teaching has resumed. :( I wouldn't want to be the bottleneck on this. Would it help if I shared my docker image to build OpenMM with HIP on Linux?

philipturner · 2023-02-01T15:11:00Z

I will try but not before the weekend. Teaching has resumed. :( I wouldn't want to be the bottleneck on this. Would it help if I shared my docker image to build OpenMM with HIP on Linux?

I don't have HIP - I have an Apple GPU but I meant "compare changes in the same style I did". The change is using OpenCL vs using HIP. Maybe someone else with AMD would install Linux?

philipturner · 2023-02-01T18:42:36Z

I found a solution. The following boosted apoa1rf from 115 ns/day to 129 ns/day.

void MetalNonbondedUtilities::computeInteractions(int forceGroups, bool includeForces, bool includeEnergy) {
    if ((forceGroups&groupFlags) == 0)
        return;
    KernelSet& kernels = groupKernels[forceGroups];
    if (kernels.hasForces) {
        // ADD THIS LINE TO EXISTING CODE
        if (useCutoff && numTiles > 0)
            context.getQueue().flush();
        
        cl::Kernel& kernel = (includeForces ? (includeEnergy ? kernels.forceEnergyKernel : kernels.forceKernel) : kernels.energyKernel);
        if (*reinterpret_cast<cl_kernel*>(&kernel) == NULL)
            kernel = createInteractionKernel(kernels.source, parameters, arguments, true, true, forceGroups, includeForces, includeEnergy);
        if (useCutoff)
            setPeriodicBoxArgs(context, kernel, 9);
        context.executeKernel(kernel, numForceThreadBlocks*forceThreadBlockSize, forceThreadBlockSize);
    }
    if (useCutoff && numTiles > 0) {
        // ADD THIS LINE TO EXISTING CODE
        if (kernels.hasForces)
            context.getQueue().flush();
        
        downloadCountEvent.wait();
        updateNeighborListSize();
    }
}

peastman · 2023-02-01T20:50:04Z

Interesting! I have no idea why that would make a difference. Here's what I see running the PME benchmark on a few different GPUs. On each one I tried four versions: the original code, removing the call to wait(), removing both the download and the wait, and keeping everything but adding the two calls to flush().

RTX 4080: It mostly makes no difference. Almost all versions are a steady 1424 ns/day. The only exception is the version where I removed the download, which was slightly slower but also seemed rather noisy. I suspect those runs were slower for an unrelated reason.

M1 Pro: The original was 130 ns/day. All the others were 147. The slowdown comes from the wait(), but the flush() removes it.

Radeon RX 5700 XT: The original is 320 ns/day. As posted above, skipping the wait speeds it up a little to 340, and skipping the download speeds it up a lot to 406. Adding flush() is almost as fast, 400 ns/day.

It looks like we may have a good optimization here. I just wish I understood why it worked! It feels a little like black magic.

philipturner · 2023-02-06T18:45:41Z

I totally agree. This counterproductive behavior of all hardware vendors is hurting progress and shows little respect for people's development efforts.

I would say that SYCL is the first universal, coordinated attempt to solve this problem. Khronos finally got it right, unlike with OpenCL. The issue - such a solution requires so much engineering effort, it's still in its infancy. Up until recently only Intel sponsored it, so it got lopped in with oneAPI. Most people think SYCL is just "another CUDA", "another HIP" but that's inaccurate.

peastman · 2023-02-06T19:03:08Z

Keep in mind that our resources are very limited. As in, I'm the only developer who works full time on OpenMM! I'm also the only developer who has been around for anything close to the full life of the project. It's not just about how to get the fastest performance on a particular GPU right now. It's about how to create a codebase that's easy to maintain and easy to add new features to. It's also about building on technologies that are around for the long term. And it's about how to prioritize effort to best serve our users.

Here's the reality we have to work with: the vast majority of production simulations are run on NVIDIA GPUs. That makes them more important than all other platforms put together. That's not a reality I'm very happy with. I want there to be more competition. I really want AMD to get serious about HPC. But I also can't afford to dedicate too much time to it, and I certainly can't do anything that would add a major ongoing maintenance burden. Writing a new platform is a lot of work, but maintaining it long term is much more work.

I've also been burned by AMD repeatedly. The very first implementation of OpenMM ran on AMD GPUs. It was written in Brook+, which was the language they were promoting for GPGPU at the time. Then they abandoned it and switched to OpenCL, which they declared was the hot new thing they were 100% committed to. And they were... right until they suddenly declared that HIP was the new hot new thing they were 100% committed to. But maybe we should instead be using SYCL, which is the even newer new hot new thing. Through all of this, AMD has never shown any sign of really being committed to HPC, or even of understanding it. Example: their compute profiler doesn't even work on Linux. Example: instead of continuing to invest in OpenCL, a cross platform standard that lots of developers have put a lot of work into supporting, they decide the whole community needs to rewrite their code to support a new proprietary API.

Sorry for complaining. :) It's hard not to be bitter when you've been through too many rounds of this. And AMD is hardly alone in this. It was Apple that originally developed OpenCL. That was back when they were still struggling to compete with Microsoft and were trying to convince developers to use open standards instead of DirectX. Once the iPhone became a dominant platform, they suddenly decided they didn't like open standards anymore and pushed everyone to use their proprietary APIs. Intel also used to be 100% dedicated to OpenCL, until they decided to create oneAPI instead.

The reason NVIDIA has come to dominate the market is that they're the only ones who've been consistent over the long term. CUDA was introduced back in 2007 and they've stuck with it ever since. They also have consistently made good hardware that was well suited to HPC. Unlike, say, the Xeon Phi monstrosity Intel tried to dump on us.

philipturner · 2023-02-06T19:18:57Z

Maybe we can just consider it hypothetically. I'm not saying we have to implement a backend. It's also a good idea to wait, see if it really sticks. By the time it does, OpenCL might really start showing its wrinkles. The US recently banned exports of A100s and H100s to China, yet crippled their semiconductor production capabilities. Simple reasoning suggests they'll rise back up, with custom Chinese accelerators in ~5-10 years (as @DanielWicz mentioned). Will those fresh hardware support then-ancient standards like OpenGL and OpenCL?

Right now I can maintain the Metal backend, maybe for 5 years, but not for 20. That's the same timeframe where we might see SYCL become established. In any case it's good to prepare for the future, anticipate how that might impact OpenMM.

peastman · 2023-02-06T19:51:23Z

I'm certainly hoping for things to swing back toward standards again!

peastman · 2023-02-06T20:06:35Z

Back to the original subject, it looks like the effect of adding the flush() calls is all over the place. Just looking at @ex-rzr's results, each of the three GPUs has a different pattern. On MI210 it consistently makes all benchmarks slower. On MI100 and V620 it makes some benchmarks faster and others slower, but not the same ones. For example, on MI100 it makes the GBSA benchmark faster, but on V620 it makes that benchmark slower.

So for AMD, this probably isn't a useful optimization. If we can figure out why it produces a speedup in some cases and a slowdown in others, perhaps we can find a better approach that consistently makes it faster (or at least not slower).

For Apple I think it could be useful. I believe it always produces a speedup on all computers we've tested? If so, we could perform the flushes on Apple GPUs only.

egallicc · 2023-02-06T22:06:15Z

Keep in mind that our resources are very limited.

Is limiting "internal" development to the Common and OpenCL platforms a realistic possibility?

I've also been burned by AMD repeatedly.

Exactly. Hardware vendors made the mess and should, ideally, own the consequences as well. Meaning they should put the resources to support their hardware with their language inventions.

But, to be fair, in this case, AMD cleaned up the mess a bit. The HIP work appears very good--judging only by the performance and ease of installation. The question is how much additional work has cost you to interact with them and how much more work it will take to fully incorporate their implementation into OpenMM.

philipturner · 2023-02-07T00:02:53Z

Is limiting "internal" development to the Common and OpenCL platforms a realistic possibility?

What if the community converged and made a SYCL backend as a plugin? HIP, Metal, and others would all be able to use the backend.

DanielWicz · 2023-02-07T13:49:13Z

There's a table in my computing center (CINES) that summarizes nicely all the frameworks

bdenhollander · 2023-02-07T17:55:19Z

@ex-rzr Thank you for the benchmarks of the three flush version. Would you also be able to run benchmarks with only 1 flush enabled? First try only before forcesKernel (Perm 1) and then only after clEnqueueReadBuffer (Perm 5). The extra pauses introduced by multiple flushes may be cancelling out the gains from a beneficial flush.

ex-rzr · 2023-02-08T09:30:30Z

MI100

8.0.0

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     750.309
rf                single      HBonds        1.5            4         NVT        OpenCL     520.171
pme               single      HBonds        1.5            4         NVT        OpenCL     299.185
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     167.932
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     111.471
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     92.7899
amoebagk          single      None          1              2         NVT        OpenCL     2.02475
amoebapme         single      None          1              2         NVT        OpenCL     9.03818
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     339.465
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     25.1879
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     8.13449

before forcesKernel (Perm 1)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     770.383
rf                single      HBonds        1.5            4         NVT        OpenCL     533.668
pme               single      HBonds        1.5            4         NVT        OpenCL     306.073
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     170.581
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     111.059
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     91.6098
amoebagk          single      None          1              2         NVT        OpenCL     2.02818
amoebapme         single      None          1              2         NVT        OpenCL     9.04325
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     342.845
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     28.4718
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     8.11413

after clEnqueueReadBuffer (Perm 5)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     773.396
rf                single      HBonds        1.5            4         NVT        OpenCL     541.363
pme               single      HBonds        1.5            4         NVT        OpenCL     303.659
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     171.381
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     113.235
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     93.8418
amoebagk          single      None          1              2         NVT        OpenCL     2.02443
amoebapme         single      None          1              2         NVT        OpenCL     9.01838
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     348.567
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     25.1565
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     8.12385

MI210

8.0.0

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     801.239
rf                single      HBonds        1.5            4         NVT        OpenCL     556.119
pme               single      HBonds        1.5            4         NVT        OpenCL     330.539
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     185.013
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     122.989
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     97.3263
amoebagk          single      None          1              2         NVT        OpenCL     2.3329
amoebapme         single      None          1              2         NVT        OpenCL     9.84983
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     363.889
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     25.9868
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     7.9088

before forcesKernel (Perm 1)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     788.712
rf                single      HBonds        1.5            4         NVT        OpenCL     556.46
pme               single      HBonds        1.5            4         NVT        OpenCL     323.952
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     183.917
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     121.757
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     97.4903
amoebagk          single      None          1              2         NVT        OpenCL     2.33588
amoebapme         single      None          1              2         NVT        OpenCL     9.87685
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     358.072
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     26.1034
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     7.90336

after clEnqueueReadBuffer (Perm 5)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     802.347
rf                single      HBonds        1.5            4         NVT        OpenCL     561.416
pme               single      HBonds        1.5            4         NVT        OpenCL     324.969
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     185.25
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     124.16
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     97.9939
amoebagk          single      None          1              2         NVT        OpenCL     2.33307
amoebapme         single      None          1              2         NVT        OpenCL     9.82599
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     362.277
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     26.1178
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     7.9478

V620

8.0.0

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     885.712
rf                single      HBonds        1.5            4         NVT        OpenCL     778.101
pme               single      HBonds        1.5            4         NVT        OpenCL     326.576
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     303.421
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     155.245
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     110.594
amoebagk          single      None          1              2         NVT        OpenCL     2.08787
amoebapme         single      None          1              2         NVT        OpenCL     10.5296
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     356.601
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     34.0437
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     9.74731

before forcesKernel (Perm 1)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     850.839
rf                single      HBonds        1.5            4         NVT        OpenCL     745.627
pme               single      HBonds        1.5            4         NVT        OpenCL     364.235
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     318.617
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     154.077
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     109.724
amoebagk          single      None          1              2         NVT        OpenCL     2.09071
amoebapme         single      None          1              2         NVT        OpenCL     10.5216
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     379.88
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     38.3378
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     9.71824

after clEnqueueReadBuffer (Perm 5)

Test              Precision   Constraints   H mass (amu)   dt (fs)   Ensemble   Platform   ns/day
gbsa              single      HBonds        1.5            4         NVT        OpenCL     872.109
rf                single      HBonds        1.5            4         NVT        OpenCL     792.096
pme               single      HBonds        1.5            4         NVT        OpenCL     334.976
apoa1rf           single      HBonds        1.5            4         NVT        OpenCL     307.479
apoa1pme          single      HBonds        1.5            4         NVT        OpenCL     160.66
apoa1ljpme        single      HBonds        1.5            4         NVT        OpenCL     114.151
amoebagk          single      None          1              2         NVT        OpenCL     2.08753
amoebapme         single      None          1              2         NVT        OpenCL     10.674
amber20-dhfr      single      HBonds        1.5            4         NVT        OpenCL     361.81
amber20-cellulose single      HBonds        1.5            4         NVT        OpenCL     34.3507
amber20-stmv      single      HBonds        1.5            4         NVT        OpenCL     9.74555

philipturner · 2023-02-08T12:30:08Z

MI100	Original	Perm 1	Perm 2	Δ Perm 1	Δ Perm 5
gbsa	750.3	770.4	773.4	2.68%	3.08%
rf	520.2	533.7	541.4	2.60%	4.08%
pme	299.2	306.1	303.7	2.31%	1.50%
apoa1rf	167.9	170.6	171.4	1.61%	2.08%
apoa1pme	111.5	111.1	113.2	-0.36%	1.52%
apoa1ljpme	92.8	91.6	92.8	-1.29%	0.00%
amoebagk	2.025	2.028	2.024	0.15%	-0.05%
amoebapme	9.038	9.043	9.018	0.06%	-0.22%
amber20-dhfr	339.5	342.8	348.6	0.97%	2.68%
amber20-cellulose	25.19	28.47	25.16	13.02%	-0.12%
amber20-stmv	8.134	8.114	8.124	-0.25%	-0.12%

MI210	Original	Perm 1	Perm 2	Δ Perm 1	Δ Perm 5
gbsa	801.2	788.8	802.3	-1.55%	0.14%
rf	556.1	556.5	561.4	0.07%	0.95%
pme	330.5	324.0	325.0	-1.97%	-1.66%
apoa1rf	185.0	183.9	185.3	-0.59%	0.16%
apoa1pme	123.0	121.8	124.2	-0.98%	0.98%
apoa1ljpme	97.3	97.49	97.99	0.20%	0.71%
amoebagk	2.333	2.336	2.333	0.13%	0.00%
amoebapme	9.850	9.877	9.826	0.27%	-0.24%
amber20-dhfr	363.9	358.1	362.3	-1.59%	-0.44%
amber20-cellulose	25.99	26.10	26.12	0.42%	0.50%
amber20-stmv	7.909	7.903	7.948	-0.08%	0.49%

V620	Original	Perm 1	Perm 2	Δ Perm 1	Δ Perm 5
gbsa	885.7	850.8	872.1	-3.94%	-1.54%
rf	778.1	745.6	792.1	-4.18%	1.80%
pme	326.6	364.2	335.0	11.51%	2.57%
apoa1rf	303.4	318.6	307.5	5.01%	1.35%
apoa1pme	155.2	154.1	160.7	-0.71%	3.54%
apoa1ljpme	110.6	109.7	114.2	-0.81%	3.25%
amoebagk	2.088	2.091	2.088	0.14%	0.00%
amoebapme	10.53	10.52	10.67	-0.09%	1.33%
amber20-dhfr	356.6	379.9	361.8	6.53%	1.46%
amber20-cellulose	34.04	38.34	34.35	12.63%	0.91%
amber20-stmv	9.747	9.718	9.746	-0.30%	-0.01%

Perhaps try the optimization here?

peastman · 2023-02-08T22:08:56Z

I used the changes in #3954 to profile on AMD. Here's a timeline for PME with the original code.

There are gaps of 50-60 us after computeNonbonded and findBlocksWithInteractions. The timeline makes it look like there are also gaps after finishSpreadCharge and reciprocalConvolution, but there aren't really. Those are the FFTs. It only captures kernels launched through OpenCLContext, so it doesn't show the ones launched by VkFFT.

Here's what it looks like with a flush() immediately before launching computeNonbonded.

There's no longer a significant gap after computeNonbonded, but instead we have a gap of a similar size immediately before it. So why is it faster? According to the profile, it isn't. It shows the time per step as basically unchanged at 1.1 ms. Which suggests that profiling is distorting the results and changing the thing I'm trying to measure. :(

Here's the timeline for apoa1pme (original code).

The same gaps are there as before. The one after findBlocksWithInteractions is about the same size as before, which makes it less significant in relation to the length of the kernels. The one after computeNonbonded has grown roughly in proportion to how long the kernel execution has grown, so it's still a similar fraction of the time.

And here it is with flush().

No significant change to much of anything.

bdenhollander · 2023-02-08T23:52:34Z

@peastman This is really cool what you've put together!

You may get a better idea of what's going on if you increase the precision of start time to string conversion to 8 digits.

6 digits says these started simultaneously:

{ "pid":1, "tid":1, "ts":9.56276e+06, "dur":0.28, "ph":"X", "name":"applyShakeToVelocities" },
{ "pid":1, "tid":1, "ts":9.56276e+06, "dur":0.28, "ph":"X", "name":"calcCenterOfMassMomentum" },
{ "pid":1, "tid":1, "ts":9.56276e+06, "dur":11.04, "ph":"X", "name":"removeCenterOfMassMomentum" },

8 digits shows 2-3 us apart:

{ "pid":1, "tid":1, "ts":11581297, "dur":0.28, "ph":"X", "name":"applyShakeToVelocities" },
{ "pid":1, "tid":1, "ts":11581300, "dur":0.28, "ph":"X", "name":"calcCenterOfMassMomentum" },
{ "pid":1, "tid":1, "ts":11581302, "dur":18.52, "ph":"X", "name":"removeCenterOfMassMomentum" },

philipturner · 2023-02-09T00:06:52Z

Which suggests that profiling is distorting the results and changing the thing I'm trying to measure.

With CL_QUEUE_ENABLE_PROFILING, the queue flushes after every single OpenCL API call. You would never be able to measure the affect of flush() that way.

peastman · 2023-02-09T00:07:50Z

With CL_QUEUE_ENABLE_PROFILING, the queue flushes after every single OpenCL API call.

Where did you find that?

philipturner · 2023-02-09T00:34:45Z

Where did you find that?

I mentioned it a while ago on the "future of osx GPU support" thread. That why initially, when I didn't know how the Mac OpenCL driver worked, I was worried. My hypothesis, it flushed after every command, causing severe CPU-side overhead. Switching to Metal would make the overhead preventable. Recently, you were there when I learned the Mac OpenCL driver does not do that. The driver does not flush after every command.

That is, unless you enable profiling. The only way that's accomplishable, submit each command in its own command buffer. Time the beginning and end of the command buffer. Creating a new command buffer is by definition flushing the queue. This limitation is universal to GPU vendors, including AMD. this hyperlink -> Cmd + F + "Enabling profiling"

peastman · 2023-02-09T01:05:59Z

That hypothesis were true, then adding the extra flush() call would have no effect when profiling is enabled. In fact, it still produces a speedup. It's only when we create the extra events and wait on them that the benefit goes away.

You're making assumptions about how the GPU works: that profiling is implemented entirely on the host side. In fact, many GPUs have hardware support for profiling and can automatically collect lots of metrics. For those GPUs, the profiling API is just a way of querying the data collected by the GPU. It need not have any impact on how the host dispatches work to it.

- Implements perm 5 from openmm#3937 (comment) Co-Authored-By: Philip Turner <philipturner.AR@gmail.com>

bdenhollander · 2023-04-08T23:26:34Z

Linux flush tests results on RX 6600. Perm 3 and 5 provide the most consistent outcomes.
Windows results favouring Perm 5 and 1 are here.

ns/day	gbsa	rf	pme	apoa1rf	apoa1pme	apoa1ljpme
Default	525.6	563.7	292.9	132.9	79.3	60.7
Perm 1	547.5	569.2	310.8	138.1	74.8	56.8
Perm 2	544.3	562.5	321.8	137.6	72.9	56.6
Perm 3	538.4	576.9	315.9	137.0	78.4	60.8
Perm 4	548.1	561.1	322.0	139.9	72.8	57.1
Perm 5	536.3	577.8	315.1	136.9	78.2	60.7

	gbsa	rf	pme	apoa1rf	apoa1pme	apoa1ljpme
Perm 1	4.15%	0.97%	6.08%	3.91%	-5.67%	-6.38%
Perm 2	3.55%	-0.21%	9.86%	3.50%	-8.05%	-6.73%
Perm 3	2.42%	2.34%	7.85%	3.04%	-1.14%	0.29%
Perm 4	4.27%	-0.45%	9.90%	5.24%	-8.23%	-5.81%
Perm 5	2.04%	2.51%	7.57%	2.99%	-1.38%	0.04%

* Enable flush on Windows - Implements perm 5 from #3937 (comment) Co-Authored-By: Philip Turner <philipturner.AR@gmail.com> * Add brackets for clarification Co-authored-by: Philip Turner <philipturner.AR@gmail.com> * Make this optimization only apply to AMD GPUs * Switch to perm 1 - Flush before call to computeNonbonded since it works well on Windows and Linux * Update OpenCLNonbondedUtilities.cpp * Perm 4 is now significantly faster on Windows * Use isAMD * Fix indentation * Fix missed variable * Remove Mac check * Remove isAMD out of Mac code * Consistent (lack of) brackets style --------- Co-authored-by: Philip Turner <philipturner.AR@gmail.com>

bdenhollander · 2023-11-02T23:34:41Z

We've made some substantial gains in OpenCL performance on AMD over the past 10 months! Thank you @philipturner for bringing great ideas and @peastman for your assistance implementing them.

Test	OpenMM 8.0 ns/day	OpenMM 8.1 ns/day	Change
gbsa	466.871	660.375	41.4%
rf	471.553	584.512	24.0%
pme	311.9	440.685	41.3%
apoa1rf	147.405	164.656	11.7%
apoa1pme	89.5964	109.023	21.7%
apoa1ljpme	68.706	82.2956	19.8%
amoebagk	1.08652	1.08781	0.1%
amoebapme	0.952845	1.10331	15.8%
amber20-dhfr	327.126	478.553	46.3%
amber20-cellulose	17.9677	25.0283	39.3%
amber20-stmv	4.29275	7.27965	69.6%

peastman · 2023-11-03T00:04:52Z

That's excellent! Thanks for pushing on this. It's good to see we've made real progress.

philipturner · 2023-11-03T04:34:37Z

This is also great because I have an AMD 7900 XTX, just waiting until I need to run a larger simulation before I get the Linux box wired up.

peastman mentioned this issue Jan 31, 2023

Use VkFFT for OpenCL #3934

Merged

bdenhollander mentioned this issue Feb 6, 2023

OpenCL 2.0 Shared Virtual Memory Performance #3947

Closed

philipturner mentioned this issue Feb 7, 2023

JIT compilation abilities AdaptiveCpp/AdaptiveCpp#940

Open

peastman added the discussion label Feb 8, 2023

philipturner mentioned this issue Feb 12, 2023

[macOS GPU Support] Fix MTLCommandBuffer bottlenecks for Apple silicon GPUs #3960

Merged

bdenhollander added a commit to bdenhollander/openmm that referenced this issue Feb 20, 2023

Enable flush on Windows

e22efe2

- Implements perm 5 from openmm#3937 (comment) Co-Authored-By: Philip Turner <philipturner.AR@gmail.com>

bdenhollander mentioned this issue Feb 20, 2023

Enable flush after enqueueReadBuffer on Windows #3973

Merged

bdenhollander mentioned this issue Mar 19, 2023

Increase Nonbonded threads on AMD GPUs #4002

Open

bdenhollander mentioned this issue Oct 18, 2023

Avoid updateNeighborListSize on smaller systems #4272

Closed

Investigate slow OpenCL performance on AMD #3937

Investigate slow OpenCL performance on AMD #3937

Comments

peastman commented Jan 31, 2023

peastman commented Jan 31, 2023

philipturner commented Jan 31, 2023

peastman commented Jan 31, 2023

philipturner commented Jan 31, 2023

bdenhollander commented Jan 31, 2023

bdenhollander commented Jan 31, 2023

philipturner commented Jan 31, 2023

peastman commented Jan 31, 2023

philipturner commented Feb 1, 2023

bdenhollander commented Feb 1, 2023

peastman commented Feb 1, 2023

philipturner commented Feb 1, 2023 • edited

bdenhollander commented Feb 1, 2023

philipturner commented Feb 1, 2023 • edited

peastman commented Feb 1, 2023

philipturner commented Feb 1, 2023

ex-rzr commented Feb 1, 2023

bdenhollander commented Feb 1, 2023

philipturner commented Feb 1, 2023 • edited

ex-rzr commented Feb 1, 2023

ex-rzr commented Feb 1, 2023 • edited

philipturner commented Feb 1, 2023

ex-rzr commented Feb 1, 2023

philipturner commented Feb 1, 2023

philipturner commented Feb 1, 2023

egallicc commented Feb 1, 2023

philipturner commented Feb 1, 2023 • edited

philipturner commented Feb 1, 2023

peastman commented Feb 1, 2023

philipturner commented Feb 6, 2023

peastman commented Feb 6, 2023

philipturner commented Feb 6, 2023

peastman commented Feb 6, 2023

peastman commented Feb 6, 2023

egallicc commented Feb 6, 2023

philipturner commented Feb 7, 2023

DanielWicz commented Feb 7, 2023 • edited

bdenhollander commented Feb 7, 2023

ex-rzr commented Feb 8, 2023 • edited

philipturner commented Feb 8, 2023

peastman commented Feb 8, 2023

bdenhollander commented Feb 8, 2023

philipturner commented Feb 9, 2023

peastman commented Feb 9, 2023

philipturner commented Feb 9, 2023

peastman commented Feb 9, 2023

bdenhollander commented Apr 8, 2023

bdenhollander commented Nov 2, 2023

peastman commented Nov 3, 2023

philipturner commented Nov 3, 2023

philipturner commented Feb 1, 2023 •

edited

philipturner commented Feb 1, 2023 •

edited

philipturner commented Feb 1, 2023 •

edited

ex-rzr commented Feb 1, 2023 •

edited

philipturner commented Feb 1, 2023 •

edited

DanielWicz commented Feb 7, 2023 •

edited

ex-rzr commented Feb 8, 2023 •

edited