-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training #4731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To avoid accuracy, the accumulation needs to be done in FP32 for training.
| (const __half**)Barray, ldb, | ||
| beta, | ||
| (__half**)Carray, ldc, | ||
| batch_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_count); [](start = 28, length = 13)
nit: Maybe explicitly set CUDA_R_16F here to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand your comment. Were you saying to specify CUDA_R_16F when calling cublasHgemmBatched? It doesn't support to set data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA_R_32F is a case. In another case, we can also explicitly provide that default argument (I wrongly through the default argument is CUDA_R_16F, sorry) for clarify.
| inline cublasStatus_t cublasGemmBatchedHelper(cublasHandle_t handle, | ||
| cublasOperation_t transa, | ||
| cublasOperation_t transb, | ||
| int m, int n, int k, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int m, int n, int k, [](start = 46, length = 20)
nit: some int parameters can be const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I learned is that const for built-in types in function parameters with pass-by-value is not necessary. See this link: https://abseil.io/tips/109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice tip!
to quote from that link:
Do use top-level const on function parameters in definitions at your (or your team’s) discretion. You might follow the same rationale as you would for when to declare a function-local variable const.
i think const still has some usefulness here, similar to const local variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember before when I use const to int as suggested here, the compiler would complain something like "it is not necessary to use const for built-in types.". Moreover, Nvidia doesn't use const for int in these cublas APIs either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major benefit of const is for readability to me (so I marked my comment as nit).
| inline cublasStatus_t cublasGemmBatchedHelper(cublasHandle_t handle, | ||
| cublasOperation_t transa, | ||
| cublasOperation_t transb, | ||
| int m, int n, int k, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int m, int n, int k, [](start = 46, length = 20)
Similar to other places, some parameters can be const.
wschin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. For training, accumulation should be in fp32.
| transb, | ||
| m, n, k, | ||
| &h_a, | ||
| (const void**)Aarray, CUDA_R_16F, lda, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid C-style casts per https://google.github.io/styleguide/cppguide.html#Casting
|
Please run E2E test and "Pytorch Frontend E2E" test, I not sure if this the batchmatmul result is different enough to affect the expected test values. |
For some 1P training task, we found accuracy issue on V100. It turns out that the accumulation for matmul needs to be done in FP32 for training.
Here are the throughput of BERT-L on V100 16GB with Lamb. As expected, the perf is almost same.