Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize multi-GPU communication #3333

Closed
aizvorski opened this issue Nov 16, 2021 · 16 comments
Closed

Optimize multi-GPU communication #3333

aizvorski opened this issue Nov 16, 2021 · 16 comments

Comments

@aizvorski
Copy link
Contributor

aizvorski commented Nov 16, 2021

When using multi-GPU, positions get copied from one GPU to all GPUs, and after the kernel is calculated, forces get copied back and summed on one GPU. This seems like a perfect use for the very fast ncclBroadcast() and ncclReduce() calls in the NCCL library.

Looking at the code, it seems most of the work now happens here:

// Copy coordinates over to each device and execute the kernel.
if (!cu.getPlatformData().peerAccessSupported) {
cu.getPosq().download(pinnedPositionBuffer, false);
cuEventRecord(event, cu.getCurrentStream());
}
else {
int numBytes = cu.getPosq().getSize()*cu.getPosq().getElementSize();
cuEventRecord(event, cu.getCurrentStream());
cuStreamWaitEvent(peerCopyStream, event, 0);
for (int i = 1; i < (int) data.contexts.size(); i++)
CHECK_RESULT(cuMemcpyAsync(data.contexts[i]->getPosq().getDevicePointer(), cu.getPosq().getDevicePointer(), numBytes, peerCopyStream), "Error copying positions");
cuEventRecord(event, peerCopyStream);
}

and

// Sum the forces from all devices.
CudaContext& cu = *data.contexts[0];
ContextSelector selector(cu);
if (!cu.getPlatformData().peerAccessSupported)
contextForces.upload(pinnedForceBuffer, false);
int bufferSize = 3*cu.getPaddedNumAtoms();
int numBuffers = data.contexts.size()-1;
void* args[] = {&cu.getForce().getDevicePointer(), &contextForces.getDevicePointer(), &bufferSize, &numBuffers};
cu.executeKernel(sumKernel, args, bufferSize);

It seems there are two code paths depending on whether peerAccessSupported is true or false. If it's false, then there is a call to CudaArray.download() the inputs and then CudaArray.upload() the outputs. If it is true, then the data is accessed using unified addressing from one GPU to another (without explicit copy). I think on most recent GPUs, the peerAccessSupported=true code would be used (both on systems with NVLink/NVSwitch and plain PCIe). I'm not sure of the performance of using unified memory in this way, but it's probably not optimal (especially without prefetch). Empirically, the multi-GPU code now only reaches ~41 Gbyte/s on a 8x A100 system, vs 8x 600 Gbyte/s hardware limit; there's only 26% speedup going from 2x to 4x GPU, and no speedup at all going from 4x to 8x.

The NCCL calls use more sophisticated tree or ring routing algorithms which are aware of the topology of the links and not equivalent to using n sequential copies, and also can sum during the copy when doing reduce. This can readily reach near-hardware-limited performance when doing operations on data >= a few megabytes. So, I would like to request adding NCCL support to the multi-GPU code and/or possibly switching over entirely.

@peastman
Copy link
Member

This sounds like a good case for some profiling to identify what the real bottlenecks are. The algorithm for multi-GPU is mainly designed for small numbers of GPUs, and is unlikely to ever work well for larger numbers. That would require a completely different architecture. Here's the basic algorithm:

  1. Broadcast positions from one GPU to the others.
  2. Each GPU computes an assigned subset of the interactions.
  3. Transmit forces from each GPU back to a single GPU.
  4. That single GPU performs integration.

You can see several factors that will limit scaling. Transmitting positions and forces takes time. You can't start integration until all GPUs have finished their force computations, so if one takes longer than the others, that leaves the others sitting idle. And integration is done on only one GPU, so the others are idle during that time.

I suspect the peer-to-peer copying isn't really the bottleneck. Anyway, we don't want it to be. If the code is working well the bottleneck will be in computation, not communication, so seeing the data transfer only reaching a small fraction of the theoretical limit is a good thing. But a bit of profiling would give more information about where the real bottlenecks are.

There are MD codes designed to run on large clusters and parallelize efficiently over many nodes. They're based on domain decomposition, where each node "owns" the atoms in some small region. Positions and forces only need to be communicated between nearby regions. Each node does integration for the atoms it owns, so that is done in parallel, and it can start as soon its neighbors have finished their force computations, even if some distant regions aren't finished yet. That's probably the sort of architecture we would need to scale to large numbers of GPUs. But doing that is much more complicated.

@aizvorski
Copy link
Contributor Author

@peastman Thanks! Indeed, profiling: any suggestions on how to do it? I haven't used nvprof/nsight before, but if you tell me how I would be happy to collect some traces.

@peastman
Copy link
Member

Nsight Systems is a good tool for collecting this kind of information. Here is a talk by someone at NVIDIA giving a thorough introduction to using it. There's a lot of information it can collect. You can see, for example, what each GPU is doing, when it's idle, when data is being transferred between GPUs, how long it takes, etc.

@aizvorski
Copy link
Contributor Author

aizvorski commented Nov 17, 2021

@peastman First round of profiling done! Traces here: https://gist.github.com/aizvorski/833ca0a9eecd85f929ab968cf4335249 I can open these in nsight-sys and look around, hopefully you can open them as well and they have all the necessary info.

These were collected with

nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx --delay=150 --duration=10 --output ${name} python benchmark.py --platform=CUDA --test=amber20-stmv --seconds=200 --device=0,1,2,3,4,5,6,7

or whatever the list of devices should be (ran with 1, 2, 4 and 8). For a few I also added the unified memory profiling --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true flags, those have "um" in the filename.

The workload is always the same: amber20-stmv, ~1M atoms using the benchmark script. Also openmm=7.6.0, cuda=11.5, nvidia drivers=495.29.05.

I ran on 4x A10G GPUs, no NVLink (AWS g5.12xlarge), and 8x V100, with NVLink (AWS p3.16xlarge). AWS has some capacity issues so no A100s yet, but I'll do those as well as soon as possible :)

@aizvorski
Copy link
Contributor Author

aizvorski commented Nov 17, 2021

@peastman A representative step on 4x V100, with ~11.1ms per step, just looking at GPU 0:

sortBuckets 2.9ms
findBlocksWithInteractions 2.4ms
gridInterpolateForce 0.8ms
computeBondedForces 0.3ms
computeNonbonded 1.5ms
Memcpy PtoP(3->0) 0.5ms 25.6MB 48GB/s
Memcpy PtoP(2->0) 1.0ms 25.6MB 24GB/s
Memcpy PtoP(1->0) 1.0ms 25.6MB 24GB/s
sumForces 0.2ms
... integrator stuff
Memcpy PtoP(0->1) 0.7ms 17MB 24GB/s

Copying data takes ~30% of the time in this case.

NVLink on this system should have up to 300GB/s peer to peer, so the speeds here seem like PCI speeds not NVLink speeds. How to check if NVLink is actually being used? Is there even a way to turn NVLink off if it is physically there?

Why is the amount of data being transferred in 25.6MB blocks? Is this the forces, 1M particles * 3 dimensions * 8 bytes? If so, why double precision? (benchmark script was in the default single precision mode, not mixed)

On the other hand, why is the amount of data being transferred out only 17MB? That seems low for broadcasting positions to all GPUs- and shows up as only a single copy. I would kinda have expected 3x 12MB forces transferred in, and 3x 12MB positions transferred out.

@aizvorski
Copy link
Contributor Author

aizvorski commented Nov 17, 2021

(Also by the way what is sortBuckets and findBlocksWithInteractions? they take a lot of time, near half the total, and probably only run on GPU0)

@peastman
Copy link
Member

Thanks! I'll take a look.

Is this the forces, 1M particles * 3 dimensions * 8 bytes? If so, why double precision?

Forces are represented internally as 64 bit fixed point, regardless of which precision mode you use.

On the other hand, why is the amount of data being transferred out only 17MB?

I don't know. The code performs a separate cuMemcpyAsync() for each one, but possibly the driver is smart enough to realize they're all copying the same data and they can be combined.

(Also by the way what is sortBuckets and findBlocksWithinInteractions?

That's part of building the neighbor list.

@peastman
Copy link
Member

There's a lot of great information in there. Here's the portion of the timeline for a single time step of the v100-4x-um file. (What is "um"? Unified memory? Anyway, there doesn't seem to be any significant difference between the two.

profile

There's a lot to discuss. Let's start at the 254 ms mark. The first GPU does three things on three different streams.

  • Stream 7 computes bonded and nonbonded interactions.
  • Stream 63 computes the PME reciprocal space interactions.
  • Stream 62 performs three memcpy operations to copy the positions to the other three GPUs.

Notice that the copies are performed in sequence. If we could do them in parallel, that would save a little bit of time.

It looks like sortBuckets() is really slow, but actually that's just because it's having to compete with PME. gridSpreadCharge() floods the bus with write operations, which makes anything else trying to access memory at the same time much slower. On the other GPUs, sortBuckets() is much faster.

findBlocksWithInteractions() is expensive. If you look through the profile, however, you'll notice that's only true on roughly half the steps. If no atom has moved further than a threshold distance, it can return immediately. Sometimes it's expensive, and sometimes it takes hardly any time at all.

findBlocksWithInteractions() and computeNonbonded() both get parallelized across GPUs. The more GPUs you have, the less time they'll take on each one.

Once all the computation is done, the forces get copied back to the first GPU. Here again the three copies get performed in series. Not only that, but they don't even start until the first GPU is done with its computation, even though the other GPUs finish earlier and the bus is free.

Finally the first GPU performs integration, and then the next step begins.

Here are the approximate timings for the different parts of the step, as seen by GPUs 2, 3, and 4.

  • 1.8 ms copying positions. None of them begins on computation until all three of them are done receiving positions.
  • 5 ms computing forces.
  • 1 ms waiting for the first GPU to finish its computation.
  • 2.7 ms copying forces.
  • 0.8 ms waiting for the first GPU to do integration.

It definitely looks like there's some performance to be gained by optimizing the communication. I don't know whether NCCL is the best way to do it. I try to keep dependencies to an absolute minimum, and NCCL appears to be Linux only and require you to log in with a developer ID to download it. Both of those are major strikes against it. Anyway, we can look into different options.

@aizvorski
Copy link
Contributor Author

aizvorski commented Nov 18, 2021

@peastman Wow! Very nice. Indeed, it looks the copies are always sequential.

I figured out how to check for NVLink, and yes, it was being used on the V100 system. nvidia-smi nvlink -gt d shows the total data transfer counters, and nvidia-smi nvlink -s shows the status. Example outputs here: nvlink-test.txt

Each GPU has 6 links, each of which can send and receive at 25GB/s (so 6*2*25 = 300GB/s aggregate, as advertised, but single point to point links are definitely not 300 GB/s). With 8 GPUs I'm not sure what topology they're connected in, but it seems 0<->3 for example can use two links, while 0<->1 and 0<->2 use one link. So we are getting about the expected performance from each link, it's just that we can do more stuff simultaneously.

Agreed about adding NCCL as a dependency; it's not necessarily the right thing to do, or at least not the right thing to do first. Making transfers simultaneous probably doesn't need it(?), and would already get a lot of the potential speedup. It seems NCCL would make topology-dependent transfers easier eg doing tree broadcasts; so maybe more useful in cases with lots of GPUs and NVLink but without NVSwitch, because with NVSwitch "just do everything at the same time" is a reasonable approach :) If it helps, NCCL is used by pytorch and tensorflow already; BSD license, and it's in conda and conda-forge.

@aizvorski
Copy link
Contributor Author

aizvorski commented Nov 18, 2021

actually that's just because it's having to compete with PME. gridSpreadCharge() floods the bus with write operations, which makes anything else trying to access memory at the same time much slower.

Hmm, if PME is only done on GPU0, does that mean with enough GPUs it would make sense to offload everything else (nonbonded, bonded, sortBuckets, findBlocksWithInteractions) entirely to the other GPUs? Or, with fewer GPUs, split the work unevenly so the GPU0 PME+a little slice of other work balances out with GPU1..GPUn larger slice of other work.

Also, does that mean that RF would potentially scale much better?

5 ms computing forces.

Is that counting sortBuckets and findBlocksWithInteractions? It seems the actual computeBondedForces and computeNonbonded take much less than that, closer to 2ms.

findBlocksWithInteractions() is expensive. If you look through the profile, however, you'll notice that's only true on roughly half the steps. If no atom has moved further than a threshold distance, it can return immediately.

Maybe off-topic for this thread, but is there an explanation of how sortBuckets and findBlocksWithInteractions works? Since they are rather expensive, I'm wondering if there is a way to avoid recalculating the blocks as often (naively speaking, just add new blocks for atoms that have moved close to each other, and very rarely recalculate all blocks to improve block utilization). Is there some kind of spatial index structure like a quadtree?

@peastman
Copy link
Member

@dmclark17 might have some insight about this. Do you have any suggestions about ways of optimizing the communication?

Hmm, if PME is only done on GPU0, does that mean with enough GPUs it would make sense to offload everything else (nonbonded, bonded, sortBuckets, findBlocksWithInteractions) entirely to the other GPUs?

It does that automatically. During the first 200 steps it monitors how long each GPU takes to complete and dynamically load balances them by redistributing direct space nonbonded work:

// Balance work between the contexts by transferring a little nonbonded work from the context that
// finished last to the one that finished first.
if (cu.getComputeForceCount() < 200) {
int firstIndex = 0, lastIndex = 0;
for (int i = 0; i < (int) completionTimes.size(); i++) {
if (completionTimes[i] < completionTimes[firstIndex])
firstIndex = i;
if (completionTimes[i] > completionTimes[lastIndex])
lastIndex = i;
}
double fractionToTransfer = min(0.01, contextNonbondedFractions[lastIndex]);
contextNonbondedFractions[firstIndex] += fractionToTransfer;
contextNonbondedFractions[lastIndex] -= fractionToTransfer;
double startFraction = 0.0;
for (int i = 0; i < (int) contextNonbondedFractions.size(); i++) {
double endFraction = startFraction+contextNonbondedFractions[i];
if (i == contextNonbondedFractions.size()-1)
endFraction = 1.0; // Avoid roundoff error
data.contexts[i]->getNonbondedUtilities().setAtomBlockRange(startFraction, endFraction);
startFraction = endFraction;
}
}

@aizvorski
Copy link
Contributor Author

It does that automatically. During the first 200 steps it monitors how long each GPU takes to complete and dynamically load balances them by redistributing direct space nonbonded work

Very nice!

I guess the most ideal version of that would redistribute things so that the GPU0 force calculation (if any) finishes at about the same time as the copying of forces from other GPUs. GPU0 can still be running computeNonbonded while all the Memcpy P2P is happening (looks like it doesn't now).

Actually all other GPUs can start transfers as soon as they finish calculating forces - I think there is no need to synchronize the starting points of those transfers with anything, they just need to complete before sumForces starts.

@aizvorski
Copy link
Contributor Author

@peastman I just wanted to say this is amazing, it really gives me an appreciation of how much openmm does under the hood. Thank you again. I'll try to get more benchmarks tomorrow - no luck on A100s yet :)

Also so this doesn't get lost -

What is "um"? Unified memory?

Yes, those traces were collected with --cuda-um-gpu-page-faults=true. I wasn't sure how much overhead that would add (the docs warn about it), so I made sure to run everything once without that, and only added it in a few runs.

@aizvorski aizvorski changed the title Use NCCL for multi-GPU Optimize multi-GPU communication Nov 18, 2021
@dmclark17
Copy link
Contributor

Just to clarify, I don't think OpenMM uses Unified Memory. It is taking advantage of the Unified Virtual Address Space for the peer-to-peer memcpy operations though.

I think we should be able to parallelize the memory operations by placing them in separate streams—at least for the position broadcast, it seems doable.

It doesn't seem like the force reduction memcpy operations are launched from the CudaParallelKernels.cpp file, or I am just missing them. It would be great if we could overlap the nonbonded kernel and the communication here—somehow stream the forces back as they become ready.

@peastman
Copy link
Member

Here is where it does the position copies:

cuEventRecord(event, cu.getCurrentStream());
cuStreamWaitEvent(peerCopyStream, event, 0);
for (int i = 1; i < (int) data.contexts.size(); i++)
CHECK_RESULT(cuMemcpyAsync(data.contexts[i]->getPosq().getDevicePointer(), cu.getPosq().getDevicePointer(), numBytes, peerCopyStream), "Error copying positions");
cuEventRecord(event, peerCopyStream);

So all we would need to do is specify a different stream for each one instead of a single peerCopyStream for all of them? That also explains why none of the other GPUs starts until all copies are finished: because we record a single event that all of them wait on. So we should likewise have a separate event for each one.

Here is where we copy the forces back:

CHECK_RESULT(cuMemcpy(contextForces.getDevicePointer()+offset, cu.getForce().getDevicePointer(), numBytes), "Error copying forces");

That is executed on a separate thread for each GPU. I'm not sure why they don't start until the first GPU has finished its work, or why they don't run in parallel. Perhaps we need to use cuMemcpyAsync() instead of cuMemcpy so we can specify streams?

@peastman
Copy link
Member

peastman commented Dec 3, 2021

I think we can close this. #3347 made the communication a lot more efficient. I think there's still room to further improve multi-GPU performance, but that will come from optimizing other aspects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants