-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap gradient computation and NCCL AllReduce #361
base: master
Are you sure you want to change the base?
Conversation
0ffa9a0
to
61a1f15
Compare
61a1f15
to
47cfb4c
Compare
@@ -2348,7 +2410,7 @@ void common_free(GPT2 &model) { | |||
cudaCheck(cudaFree(cublaslt_workspace)); | |||
cublasCheck(cublasDestroy(cublas_handle)); | |||
cublasCheck(cublasLtDestroy(cublaslt_handle)); | |||
create_cudnn(); | |||
destroy_cudnn(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karpathy @PeterZhizhin cherry pick; this should be merged immediately
printf0("step %4d/%d: train loss %f (acc %f) (%f ms, %0f tok/s)\n", | ||
step + 1, train_num_batches, model.mean_loss, accumulated_loss, | ||
time_elapsed_ms, bias_corrected_ema_tokens_per_second); | ||
logger_log_train(&logger, step, model.mean_loss); | ||
|
||
// disable the profiler after 3 steps of optimization | ||
if (step == 3) { cudaProfilerStop(); } | ||
if (step == 3) { cudaCheck(cudaProfilerStop()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an independent fix too
// Aggregate grads.lnfw and grads.lnfb in a background stream | ||
floatX* layernorm_backward_pointers[] = {grads.lnfw, grads.lnfb}; | ||
size_t layernorm_backward_sizes[] = {C, C}; | ||
multi_gpu_async_all_reduce_pointers_group(2, layernorm_backward_pointers, layernorm_backward_sizes, multi_gpu_config, main_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment says background stream, but call uses main_stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, in the version this code was based on, main_stream was the background stream?
On my setup, I get the following:
Before:
After:
So, a 12% speedup.
NSight Systems profiles:
Before:
After: