Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training #4731

weixingzhang · 2020-08-07T04:04:20Z

For some 1P training task, we found accuracy issue on V100. It turns out that the accumulation for matmul needs to be done in FP32 for training.

Here are the throughput of BERT-L on V100 16GB with Lamb. As expected, the perf is almost same.

	Batch Size	Seq Len	Throughput (ex/s)
Before	64	128	213.935
	10	512	41.9966
After	64	128	213.228
	10	512	41.6287

To avoid accuracy, the accumulation needs to be done in FP32 for training.

onnxruntime/core/providers/cuda/cuda_common.h

wschin · 2020-08-07T19:39:21Z

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h

+                            (const __half**)Barray, ldb,
+                            beta,
+                            (__half**)Carray, ldc,
+                            batch_count);


batch_count); [](start = 28, length = 13)

nit: Maybe explicitly set CUDA_R_16F here to avoid confusion?

Maybe I misunderstand your comment. Were you saying to specify CUDA_R_16F when calling cublasHgemmBatched? It doesn't support to set data type.

CUDA_R_32F is a case. In another case, we can also explicitly provide that default argument (I wrongly through the default argument is CUDA_R_16F, sorry) for clarify.

wschin · 2020-08-07T19:39:47Z

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h

+inline cublasStatus_t cublasGemmBatchedHelper(cublasHandle_t handle,
+                                              cublasOperation_t transa,
+                                              cublasOperation_t transb,
+                                              int m, int n, int k,


int m, int n, int k, [](start = 46, length = 20)

nit: some int parameters can be const.

What I learned is that const for built-in types in function parameters with pass-by-value is not necessary. See this link: https://abseil.io/tips/109

nice tip!
to quote from that link:

Do use top-level const on function parameters in definitions at your (or your team’s) discretion. You might follow the same rationale as you would for when to declare a function-local variable const.

i think const still has some usefulness here, similar to const local variables.

I remember before when I use const to int as suggested here, the compiler would complain something like "it is not necessary to use const for built-in types.". Moreover, Nvidia doesn't use const for int in these cublas APIs either.

The major benefit of const is for readability to me (so I marked my comment as nit).

wschin · 2020-08-07T19:40:05Z

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h

+inline cublasStatus_t cublasGemmBatchedHelper(cublasHandle_t handle,
+                                              cublasOperation_t transa,
+                                              cublasOperation_t transb,
+                                              int m, int n, int k,


int m, int n, int k, [](start = 46, length = 20)

Similar to other places, some parameters can be const.

wschin

LGTM. For training, accumulation should be in fp32.

edgchen1 · 2020-08-08T01:35:10Z

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h

+                             transb,
+                             m, n, k,
+                             &h_a,
+                             (const void**)Aarray, CUDA_R_16F, lda,


avoid C-style casts per https://google.github.io/styleguide/cppguide.html#Casting

SherlockNoMad · 2020-08-10T16:59:29Z

Please run E2E test and "Pytorch Frontend E2E" test, I not sure if this the batchmatmul result is different enough to affect the expected test values.

weixingzhang added 2 commits August 6, 2020 06:02

use cublas extenstion API for fp16

dbcaf2e

Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training

da8b480

To avoid accuracy, the accumulation needs to be done in FP32 for training.

weixingzhang requested a review from a team as a code owner August 7, 2020 04:04

weixingzhang added the training issues related to ONNX Runtime training; typically submitted using template label Aug 7, 2020

weixingzhang requested review from SherlockNoMad, souptc and wschin August 7, 2020 04:06

wschin reviewed Aug 7, 2020

View reviewed changes

onnxruntime/core/providers/cuda/cuda_common.h Show resolved Hide resolved

wschin reviewed Aug 7, 2020

View reviewed changes

wschin approved these changes Aug 7, 2020

View reviewed changes

edgchen1 reviewed Aug 8, 2020

View reviewed changes

SherlockNoMad approved these changes Aug 10, 2020

View reviewed changes

weixingzhang merged commit afa8956 into master Aug 14, 2020

weixingzhang deleted the wezhan/matmul branch August 14, 2020 09:12

Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training #4731

Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training #4731

Uh oh!

Conversation

weixingzhang commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad commented Aug 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weixingzhang commented Aug 7, 2020 •

edited

Loading