-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multi-GPU] llm.c now runs on multiple GPUs with NCCL #248
Conversation
Very exciting! Looking forward to step through this in detail tomorrow. |
#ifdef MULTI_GPU | ||
// Average all gradients. | ||
char* grads_memory_iterator = (char*)model->grads_memory; | ||
for (int i = 0; i < NUM_PARAMETER_TENSORS; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh really sad to have this loop here :( with @ngc92 changed this will be just two calls. but ideally it would be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is ugly. Looking forward to the rearrangements.
I think we would need at least 2 calls, as long as we have both 16 and 32 bit gradients.
Sadly I'm not able to test these changes yet because Lambda brought down my box during a re-image :(. I expect I'll get it back later today and will run & take a closer look. |
640K tokens/s on 8xA100 with -np 8 -b 3 (BF16 mode)! I couldn't get higher batch sizes to work at -np 8, which I assume is related to the existing batch sizes bug, and not an issue with the PR. Perf drops from about ~86K (NO_MULTI_GPU) to ~82K (1 GPU with MULTI_GPU) to ~80K per GPU when going to 8x multi-GPU which seems really good for a first implementation! We'd probably want to avoid the overhead of multi GPU when there's only 1 GPU though, even if it's being built by nccl for some reason. I needed to fix 2 small issues to get it working:
|
@PeterZhizhin - ran into an interesting problem with the multi-gpu change above in train_gpt2.cu. Designated initializers (been in GCC C for a while now) are not supported in C++ (e.g. Cuda) until C++ 20. Microsoft's C++ compiler errors out the code below UNLESS you specify C++ 20 which is arguably correct. Question: would you be okay if we change this code to simple C code (see diff below)? This keeps us from forcing the NVCC builds on Windows to C++ 20. I can do the PR if you're okay with the change? Thanks! |
I have tested this on a vast.ai setup with 2 RTX A2000. This shows that the code works, but my setup is not good for profiling, since it doesn't have NVLink.
Here are my results:
On 1 GPU:
On 2 GPUs (steps with same step number show results from different processes):
It does run faster on equal batch sizes (~190ms vs ~240ms).
However, what's more important is that it now runs on larger batch sizes: