BiasDropoutFusion by SherlockNoMad · Pull Request #4167 · microsoft/onnxruntime

SherlockNoMad · 2020-06-09T05:44:56Z

Fuse Add + Dropout + Add into a single BiasDropout op.

E2E test passed: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=127697&view=results

Kernel Benchmark

nvprof --print-gpu-summary ./onnxruntime_training_bert --model_name /bert_ort/bert_models/nv/bert-large/bert-large-uncased_L_24_H_1024_A_16_V_30528_S_512_Dp_0.1_optimized_layer_norm --train_data_dir /bert_data/128/books_wiki_en_corpus/train --test_data_dir /bert_data/128/books_wiki_en_corpus/test --train_batch_size 32 --mode train --num_train_steps 100 --display_loss_steps 1 --warmup_ratio=0.2843 --warmup_mode=Poly --optimizer lamb --gradient_accumulation_steps 1 --max_predictions_per_seq=20 --use_nccl  --use_mixed_precision --allreduce_in_fp16

Before
2.12% 422.14ms 7300 DropoutKernel
1.57% 312.42ms 7300 DropoutGradientKernel

After
1.43% 280.37ms 7300 DropoutGradientKernel
1.39% 272.51ms 4800 BiasDropoutKernel
1.05% 206.41ms 2500 DropoutKernel

BERT-L run Benchmark

./onnxruntime_training_bert --model_name /bert_ort/bert_models/nv/bert-large/bert-large-uncased_L_24_H_1024_A_16_V_30528_S_512_Dp_0.1_optimized_layer_norm --train_data_dir /bert_data/128/books_wiki_en_corpus/train --test_data_dir /bert_data/128/books_wiki_en_corpus/test --train_batch_size 64 --mode train --num_train_steps 228 --display_loss_steps 1 --warmup_ratio=0.2843 --warmup_mode=Poly --optimizer lamb --gradient_accumulation_steps 1 --max_predictions_per_seq=20 --use_nccl  --use_mixed_precision --allreduce_in_fp16

Before
Stabilized Throughput: 180.876 Examples / Second

After
Stabilized Throughput: 182.82 Examples / Second

Gain: 1.07%

Dropout kernel for residual input BiasDropout Fusion to take residual input Fix BiasDropout Kernel Optimize DropoutGrad with 4 elements per thread

edgchen1 · 2020-06-30T21:26:35Z

      if (li < N) {
        mask_data[li] = (&rand.x)[i] < p;
-        Y_data[li] = X_data[li] * T(mask_data[li]) * scale;
+        Y_data[li] = T(float(X_data[li]) * mask_data[li] * scale);


should we do the math with floats even when T is double?

kkaranasos · 2020-06-30T23:06:21Z

+Fuse Add + Dropout + optional Add to BiasDropoutFusion
+
+*/
+class BiasDropoutFusion : public GraphTransformer {


Why don't we make this a RewriteRule? From a quick skim, it seems that the fusion is quite local, so we can avoid traversing the whole tree and call Resolve.

SherlockNoMad requested review from pengwa and weixingzhang June 9, 2020 05:44

SherlockNoMad requested a review from a team as a code owner June 9, 2020 05:44

SherlockNoMad added the training issues related to ONNX Runtime training; typically submitted using template label Jun 10, 2020

Implement BiasDropout Fusion and Kernel

b6da6ac

Dropout kernel for residual input BiasDropout Fusion to take residual input Fix BiasDropout Kernel Optimize DropoutGrad with 4 elements per thread

SherlockNoMad force-pushed the bahuang/bias_dropout branch from 896551e to b6da6ac Compare June 15, 2020 17:18

Add graph transformer UT

aa50c2a

SherlockNoMad changed the title ~~[Draft] BiasDropoutFusion~~ BiasDropoutFusion Jun 16, 2020

SherlockNoMad requested a review from edgchen1 June 16, 2020 07:43