Skip to content

Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode#6932

Merged
wschin merged 2 commits intomasterfrom
wechi/fix-loss-scale-master
Mar 8, 2021
Merged

Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode#6932
wschin merged 2 commits intomasterfrom
wechi/fix-loss-scale-master

Conversation

@wschin
Copy link
Copy Markdown
Contributor

@wschin wschin commented Mar 8, 2021

Recent experiments reveal a divergence problem introduced by pipeline parallel PR. As shown in the following figure, the green line (with PP) differs from gray (without PP) lines. With this PR, gray line becomes orange curve and overlaps with the correct green line.
image
The used script is from ORT's BERT example (based on NV-bert). To speed up the experiment, we use very small model but that model is enough for detecting mixed-precision problem within 1 min. Here is the configuration we use

#bert_config.json

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 32,
  "initializer_range": 0.02,
  "intermediate_size": 32,
  "max_position_embeddings": 512,
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

and

# part of run_pretraining_ort.sh
...

precision=${3:-"fp16"}
num_gpus=${4:-8}
gpu_memory_limit_gb=${26:-"32"}

seed=${12:-42}
job_name=${13:-"bert_lamb_pretraining"}
allreduce_post_accumulation=${14:-"true"}
allreduce_post_accumulation_fp16=${15:-"true"}

resume_training=${8:-"false"}
create_logfile=${9:-"true"}
accumulate_gradients=${10:-"true"}
deepspeed_zero_stage=${27:-"false"}

train_batch_size=${1:-128}
learning_rate=${2:-"6e-3"}
warmup_proportion=${5:-"0.2843"}
train_steps=${6:-1000}
save_checkpoint_steps=${7:-200}
gradient_accumulation_steps=${11:-4}

train_batch_size_phase2=${17:-128}
learning_rate_phase2=${18:-"4e-3"}
warmup_proportion_phase2=${19:-"0.128"}
train_steps_phase2=${20:-1}
gradient_accumulation_steps_phase2=${11:-2}

...

@wschin wschin requested a review from a team as a code owner March 8, 2021 08:23
@wschin wschin added training issues related to ONNX Runtime training; typically submitted using template type:bug labels Mar 8, 2021
@wschin wschin merged commit de6e66f into master Mar 8, 2021
@wschin wschin deleted the wechi/fix-loss-scale-master branch March 8, 2021 13:12
@wschin
Copy link
Copy Markdown
Contributor Author

wschin commented Mar 8, 2021

This PR replaces #6929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants