New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize multi-GPU communication #3333
Comments
This sounds like a good case for some profiling to identify what the real bottlenecks are. The algorithm for multi-GPU is mainly designed for small numbers of GPUs, and is unlikely to ever work well for larger numbers. That would require a completely different architecture. Here's the basic algorithm:
You can see several factors that will limit scaling. Transmitting positions and forces takes time. You can't start integration until all GPUs have finished their force computations, so if one takes longer than the others, that leaves the others sitting idle. And integration is done on only one GPU, so the others are idle during that time. I suspect the peer-to-peer copying isn't really the bottleneck. Anyway, we don't want it to be. If the code is working well the bottleneck will be in computation, not communication, so seeing the data transfer only reaching a small fraction of the theoretical limit is a good thing. But a bit of profiling would give more information about where the real bottlenecks are. There are MD codes designed to run on large clusters and parallelize efficiently over many nodes. They're based on domain decomposition, where each node "owns" the atoms in some small region. Positions and forces only need to be communicated between nearby regions. Each node does integration for the atoms it owns, so that is done in parallel, and it can start as soon its neighbors have finished their force computations, even if some distant regions aren't finished yet. That's probably the sort of architecture we would need to scale to large numbers of GPUs. But doing that is much more complicated. |
@peastman Thanks! Indeed, profiling: any suggestions on how to do it? I haven't used nvprof/nsight before, but if you tell me how I would be happy to collect some traces. |
Nsight Systems is a good tool for collecting this kind of information. Here is a talk by someone at NVIDIA giving a thorough introduction to using it. There's a lot of information it can collect. You can see, for example, what each GPU is doing, when it's idle, when data is being transferred between GPUs, how long it takes, etc. |
@peastman First round of profiling done! Traces here: https://gist.github.com/aizvorski/833ca0a9eecd85f929ab968cf4335249 I can open these in nsight-sys and look around, hopefully you can open them as well and they have all the necessary info. These were collected with
or whatever the list of devices should be (ran with 1, 2, 4 and 8). For a few I also added the unified memory profiling The workload is always the same: amber20-stmv, ~1M atoms using the benchmark script. Also openmm=7.6.0, cuda=11.5, nvidia drivers=495.29.05. I ran on 4x A10G GPUs, no NVLink (AWS g5.12xlarge), and 8x V100, with NVLink (AWS p3.16xlarge). AWS has some capacity issues so no A100s yet, but I'll do those as well as soon as possible :) |
@peastman A representative step on 4x V100, with ~11.1ms per step, just looking at GPU 0:
Copying data takes ~30% of the time in this case. NVLink on this system should have up to 300GB/s peer to peer, so the speeds here seem like PCI speeds not NVLink speeds. How to check if NVLink is actually being used? Is there even a way to turn NVLink off if it is physically there? Why is the amount of data being transferred in 25.6MB blocks? Is this the forces, 1M particles * 3 dimensions * 8 bytes? If so, why double precision? (benchmark script was in the default single precision mode, not mixed) On the other hand, why is the amount of data being transferred out only 17MB? That seems low for broadcasting positions to all GPUs- and shows up as only a single copy. I would kinda have expected 3x 12MB forces transferred in, and 3x 12MB positions transferred out. |
(Also by the way what is sortBuckets and findBlocksWithInteractions? they take a lot of time, near half the total, and probably only run on GPU0) |
Thanks! I'll take a look.
Forces are represented internally as 64 bit fixed point, regardless of which precision mode you use.
I don't know. The code performs a separate
That's part of building the neighbor list. |
There's a lot of great information in there. Here's the portion of the timeline for a single time step of the v100-4x-um file. (What is "um"? Unified memory? Anyway, there doesn't seem to be any significant difference between the two. There's a lot to discuss. Let's start at the 254 ms mark. The first GPU does three things on three different streams.
Notice that the copies are performed in sequence. If we could do them in parallel, that would save a little bit of time. It looks like
Once all the computation is done, the forces get copied back to the first GPU. Here again the three copies get performed in series. Not only that, but they don't even start until the first GPU is done with its computation, even though the other GPUs finish earlier and the bus is free. Finally the first GPU performs integration, and then the next step begins. Here are the approximate timings for the different parts of the step, as seen by GPUs 2, 3, and 4.
It definitely looks like there's some performance to be gained by optimizing the communication. I don't know whether NCCL is the best way to do it. I try to keep dependencies to an absolute minimum, and NCCL appears to be Linux only and require you to log in with a developer ID to download it. Both of those are major strikes against it. Anyway, we can look into different options. |
@peastman Wow! Very nice. Indeed, it looks the copies are always sequential. I figured out how to check for NVLink, and yes, it was being used on the V100 system. Each GPU has 6 links, each of which can send and receive at 25GB/s (so Agreed about adding NCCL as a dependency; it's not necessarily the right thing to do, or at least not the right thing to do first. Making transfers simultaneous probably doesn't need it(?), and would already get a lot of the potential speedup. It seems NCCL would make topology-dependent transfers easier eg doing tree broadcasts; so maybe more useful in cases with lots of GPUs and NVLink but without NVSwitch, because with NVSwitch "just do everything at the same time" is a reasonable approach :) If it helps, NCCL is used by pytorch and tensorflow already; BSD license, and it's in conda and conda-forge. |
Hmm, if PME is only done on GPU0, does that mean with enough GPUs it would make sense to offload everything else (nonbonded, bonded, sortBuckets, findBlocksWithInteractions) entirely to the other GPUs? Or, with fewer GPUs, split the work unevenly so the GPU0 PME+a little slice of other work balances out with GPU1..GPUn larger slice of other work. Also, does that mean that RF would potentially scale much better?
Is that counting sortBuckets and findBlocksWithInteractions? It seems the actual computeBondedForces and computeNonbonded take much less than that, closer to 2ms.
Maybe off-topic for this thread, but is there an explanation of how sortBuckets and findBlocksWithInteractions works? Since they are rather expensive, I'm wondering if there is a way to avoid recalculating the blocks as often (naively speaking, just add new blocks for atoms that have moved close to each other, and very rarely recalculate all blocks to improve block utilization). Is there some kind of spatial index structure like a quadtree? |
@dmclark17 might have some insight about this. Do you have any suggestions about ways of optimizing the communication?
It does that automatically. During the first 200 steps it monitors how long each GPU takes to complete and dynamically load balances them by redistributing direct space nonbonded work: openmm/platforms/cuda/src/CudaParallelKernels.cpp Lines 232 to 254 in f477b10
|
Very nice! I guess the most ideal version of that would redistribute things so that the GPU0 force calculation (if any) finishes at about the same time as the copying of forces from other GPUs. GPU0 can still be running computeNonbonded while all the Memcpy P2P is happening (looks like it doesn't now). Actually all other GPUs can start transfers as soon as they finish calculating forces - I think there is no need to synchronize the starting points of those transfers with anything, they just need to complete before sumForces starts. |
@peastman I just wanted to say this is amazing, it really gives me an appreciation of how much openmm does under the hood. Thank you again. I'll try to get more benchmarks tomorrow - no luck on A100s yet :) Also so this doesn't get lost -
Yes, those traces were collected with |
Just to clarify, I don't think OpenMM uses Unified Memory. It is taking advantage of the Unified Virtual Address Space for the peer-to-peer memcpy operations though. I think we should be able to parallelize the memory operations by placing them in separate streams—at least for the position broadcast, it seems doable. It doesn't seem like the force reduction memcpy operations are launched from the |
Here is where it does the position copies: openmm/platforms/cuda/src/CudaParallelKernels.cpp Lines 196 to 200 in f477b10
So all we would need to do is specify a different stream for each one instead of a single Here is where we copy the forces back:
That is executed on a separate thread for each GPU. I'm not sure why they don't start until the first GPU has finished its work, or why they don't run in parallel. Perhaps we need to use |
I think we can close this. #3347 made the communication a lot more efficient. I think there's still room to further improve multi-GPU performance, but that will come from optimizing other aspects. |
When using multi-GPU, positions get copied from one GPU to all GPUs, and after the kernel is calculated, forces get copied back and summed on one GPU. This seems like a perfect use for the very fast ncclBroadcast() and ncclReduce() calls in the NCCL library.
Looking at the code, it seems most of the work now happens here:
openmm/platforms/cuda/src/CudaParallelKernels.cpp
Lines 188 to 201 in f477b10
and
openmm/platforms/cuda/src/CudaParallelKernels.cpp
Lines 221 to 230 in f477b10
It seems there are two code paths depending on whether peerAccessSupported is true or false. If it's false, then there is a call to
CudaArray.download()
the inputs and thenCudaArray.upload()
the outputs. If it is true, then the data is accessed using unified addressing from one GPU to another (without explicit copy). I think on most recent GPUs, the peerAccessSupported=true code would be used (both on systems with NVLink/NVSwitch and plain PCIe). I'm not sure of the performance of using unified memory in this way, but it's probably not optimal (especially without prefetch). Empirically, the multi-GPU code now only reaches ~41 Gbyte/s on a 8x A100 system, vs 8x 600 Gbyte/s hardware limit; there's only 26% speedup going from 2x to 4x GPU, and no speedup at all going from 4x to 8x.The NCCL calls use more sophisticated tree or ring routing algorithms which are aware of the topology of the links and not equivalent to using n sequential copies, and also can sum during the copy when doing reduce. This can readily reach near-hardware-limited performance when doing operations on data >= a few megabytes. So, I would like to request adding NCCL support to the multi-GPU code and/or possibly switching over entirely.
The text was updated successfully, but these errors were encountered: