Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode by wschin · Pull Request #6932 · microsoft/onnxruntime

wschin · 2021-03-08T08:23:49Z

Recent experiments reveal a divergence problem introduced by pipeline parallel PR. As shown in the following figure, the green line (with PP) differs from gray (without PP) lines. With this PR, gray line becomes orange curve and overlaps with the correct green line.

The used script is from ORT's BERT example (based on NV-bert). To speed up the experiment, we use very small model but that model is enough for detecting mixed-precision problem within 1 min. Here is the configuration we use

#bert_config.json

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 32,
  "initializer_range": 0.02,
  "intermediate_size": 32,
  "max_position_embeddings": 512,
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

and

# part of run_pretraining_ort.sh
...

precision=${3:-"fp16"}
num_gpus=${4:-8}
gpu_memory_limit_gb=${26:-"32"}

seed=${12:-42}
job_name=${13:-"bert_lamb_pretraining"}
allreduce_post_accumulation=${14:-"true"}
allreduce_post_accumulation_fp16=${15:-"true"}

resume_training=${8:-"false"}
create_logfile=${9:-"true"}
accumulate_gradients=${10:-"true"}
deepspeed_zero_stage=${27:-"false"}

train_batch_size=${1:-128}
learning_rate=${2:-"6e-3"}
warmup_proportion=${5:-"0.2843"}
train_steps=${6:-1000}
save_checkpoint_steps=${7:-200}
gradient_accumulation_steps=${11:-4}

train_batch_size_phase2=${17:-128}
learning_rate_phase2=${18:-"4e-3"}
warmup_proportion_phase2=${19:-"0.128"}
train_steps_phase2=${20:-1}
gradient_accumulation_steps_phase2=${11:-2}

...

wschin · 2021-03-08T17:10:55Z

This PR replaces #6929.

wschin added 2 commits March 8, 2021 08:08

Fix missed Loss scale

58e63dc

not to dump

e22f94a

wschin requested a review from a team as a code owner March 8, 2021 08:23

wschin added training issues related to ONNX Runtime training; typically submitted using template type:bug labels Mar 8, 2021

SherlockNoMad approved these changes Mar 8, 2021

View reviewed changes

wschin merged commit de6e66f into master Mar 8, 2021

wschin deleted the wechi/fix-loss-scale-master branch March 8, 2021 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode#6932

Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode#6932
wschin merged 2 commits intomasterfrom
wechi/fix-loss-scale-master

wschin commented Mar 8, 2021 •

edited

Loading

Uh oh!

wschin commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wschin commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wschin commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wschin commented Mar 8, 2021 •

edited

Loading