CL script update (microsoft#18)

* cl update * script update
pengwa · Oct 30, 2021 · 54884b5 · 54884b5
1 parent db97cd2
commit 54884b5
Show file tree

Hide file tree

Showing 7 changed files with 223 additions and 142 deletions.
diff --git a/examples/curriculum_learning/README.md b/examples/curriculum_learning/README.md
@@ -1,32 +1 @@
-This is a short tutorial of how to use/tune the curriculum learning (CL) integration. Currently it is only integrated for GPT pre-training. For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084).
-
-# Disable batch size warmup (--rampup-batch-size)
-In our [paper](https://arxiv.org/abs/2108.06084) section 5.4 we demonstrate that curriculum learning (seqlen-based) provides much better training stability than the batch size warmup technique. So when using CL you need to remove the `--rampup-batch-size` config in your training script. It's not recommended to use both CL and batch size warmup, because both of them will reduce the number of tokens in a batch. Another related change you might want is to increase your micro batch size, since without batch size warmup your batch size will be fixed now.
-
-# Token-based training termination
-
-Because CL changes length of each sequence/sample during training, it is very hard/impossible to use number of steps/samples to terminate the training exactly at the desired number of tokens. Thus we add a `--train-tokens` config as an alternative accurate token-based termination. We recommend increase your original `--train-samples` or `--train-iters` to a large enough number (e.g., 2X of what you used for baseline), and set `--train-tokens` at the exact desired number of training tokens (e.g., 300B for GPT-3 like training).
-
-# Token-based LR decay
-
-Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). Then you need to replace `--lr-decay-samples` with `--lr-decay-tokens` in your script.
-
-# LR warmup adjustment
-
-For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful. However, you may need to adjust your `--lr-warmup-samples` or `--lr-warmup-iters` from non-CL cases for various reasons (e.g., if you used `--rampup-batch-size` in non-CL case, for CL we don't use it so the number of samples per batch will be different at beginning). Assuming you want to use `X` tokens to warmup the LR (for OpenAI GPT-3 this was 375M tokens), then for CL case you shall set `--lr-warmup-samples` as `X` divided by the `min_difficulty` below, or set `--lr-warmup-iters` as `X` divided by `min_difficulty * --global-batch-size`. This is a rough estimation based on that CL starts from seqlen `min_difficulty` and it won't increase too much during LR warmup.
-
-# Token-based tensorboard
-
-Because of the above changes, we also add token-based tensorboard scalars. We also add scalars that plot the seqlen at each step.
-
-# Curriculum learning hyperparameters tuning strategy
-
-The curriculum learning hyperparameters are all located in the deepspeed config json file (see the example `ds_config_cl.json` in this dir). There are a few config entries that you may need to adjust to your circumstances, and two of which require some tuning. In our [paper](https://arxiv.org/abs/2108.06084) Appendix A.1 we have a more detailed tuning strategy description.
-
-1. `max_difficulty` should be set as the full seqlen (i.e., your `--seq-length`). No need to tune this.
-
-2. `min_difficulty` is the beginning seqlen used by CL. In general smaller `min_difficulty` could provide better stability/convergence speed benefit. However we observe that for a larger model or for different training data, starting from a very small seqlen could lead to significant validation PPL fluctuation (or even divergence) at the very beginning. We recommend to start with `min_difficulty` at 64, and then increase it if you observe problems at the very beginning. Note that to enable Tensor Core acceleration you should always use a multiple of 8.
-
-3. `total_curriculum_step` is the total number of steps used by CL. In general larger `total_curriculum_step` could provide better stability/convergence speed benefit. However we observe that a too large `total_curriculum_step` could lead to overfitting and significant validation PPL fluctuation (or even divergence) at the first few multiple of LR warmup steps. In our paper we have a detailed tuning strategy based on binary search. However, if you want to reduce the tuning effort we recommend directly setting `total_curriculum_step` as half of baseline's total number of steps. This may not provide the highest convergence speed benefit, but should provide enough training stability gains.
-
-4. `difficulty_step` is the change in seq length per CL step. A smaller value is preferable since it gives more smooth CL and better stability. Like `min_difficulty` it too needs to be multiple of 8 for Tensor core acceleration, thus 8 is a good default.
+This is an example of how to use DeepSpeed's curriculum learning (CL) feature which provides faster and more stable language model pre-training. Currently it is only integrated for GPT pre-training. Note that there are two curriculum learning examples in two different repos for Megatron-LM GPT-2 pre-training. Both of them have some unique features and limitations. See details in our [tutorial](https://www.deepspeed.ai/tutorials/curriculum-learning/). For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084).
diff --git a/examples/curriculum_learning/ds_pretrain_gpt2.sh b/examples/curriculum_learning/ds_pretrain_gpt2.sh
@@ -0,0 +1,150 @@
+#! /bin/bash
+
+CONFIG=$1
+TAG=$2
+MODEL_SIZE=$3
+LR=$4
+TOTAL_BATCHSIZE=$5
+SEQ_LEN=$6
+MP_SIZE=$7
+SEED=$8
+SAVE_INTERVAL=$9
+NUM_ITER=${10}
+NUM_TOKEN=${11}
+LR_DECAY_TOKEN=${12}
+LR_WARMUP_ITER=${13}
+CONFIG_TEMPLATE=${14}
+CURRICULUM_STEP=${15}
+CURRICULUM_MIN=${16}
+
+# 12-layer, 768-hidden, 12-heads, 117M parameters
+# 24-layer, 1024-hidden, 16-heads, 345M parameters
+# 36-layer, 1280-hidden, 20-heads, 774M parameters
+# 48-layer, 1600-hidden, 25-heads, 1558M parameters
+if [[ $MODEL_SIZE -eq 117 ]]; then
+        NUM_LAYERS=12
+        HIDDEN_SIZE=768
+        NUM_ATTN_HEADS=12
+elif [[ $MODEL_SIZE -eq 345 ]]; then
+        NUM_LAYERS=24
+        HIDDEN_SIZE=1024
+        NUM_ATTN_HEADS=16
+elif [[ $MODEL_SIZE -eq 774 ]]; then
+        NUM_LAYERS=36
+        HIDDEN_SIZE=1280
+        NUM_ATTN_HEADS=20
+elif [[ $MODEL_SIZE -eq 1558 ]]; then
+        NUM_LAYERS=48
+        HIDDEN_SIZE=1600
+        NUM_ATTN_HEADS=25
+else
+        echo "Model size not supported."
+        exit 1
+fi
+
+# Pipeline parallelism. 1 means no pipelines.
+PP_SIZE=1
+
+# Change for multinode config
+NUM_WORKERS=16
+NUM_GPUS_PER_WORKER=8
+NUM_GPUS=$(( ${NUM_WORKERS} * ${NUM_GPUS_PER_WORKER} ))
+if [[ $PP_SIZE -gt 0 ]]; then
+    DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) ))
+else
+    DP_SIZE=$(( ${NUM_GPUS} / ${MP_SIZE} ))
+fi
+# Batch size per gpu, here we assume grad accumulation step 1
+# you can reduce this if gpu OOM
+BATCHSIZE=$((TOTAL_BATCHSIZE/DP_SIZE))
+
+DATA_PATH=/vc_data/Megatron-LM/data/indexed_datasets/megatron
+VOCAB_PATH=/vc_data/Megatron-LM/data/gpt2-vocab.json
+MERGE_PATH=/vc_data/Megatron-LM/data/gpt2-merges.txt
+
+#ZeRO Configs
+stage=1
+
+current_time=$(date "+%Y.%m.%d-%H.%M.%S")
+script_path=$(realpath $0)
+script_dir=$(dirname $script_path)
+host="${HOSTNAME}"
+
+if [ "${CONFIG_TEMPLATE}" = "true" ]; then
+template_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json"
+config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}_min${CURRICULUM_MIN}_max${SEQ_LEN}_step${CURRICULUM_STEP}.json"
+sed "s/CONFIG_CL_MIN/${CURRICULUM_MIN}/" ${template_json} \
+    | sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
+    | sed "s/CONFIG_CL_DURATION/${CURRICULUM_STEP}/" \
+	  > ${config_json}
+else
+config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json"
+fi
+
+JOB_NAME="gpt2_${MODEL_SIZE}M_bsz${TOTAL_BATCHSIZE}_seq${SEQ_LEN}_lr${LR}_warmup${LR_WARMUP_ITER}_decay${LR_DECAY_TOKEN}_seed${SEED}_${TAG}_stage${stage}_n${NUM_WORKERS}_g${NUM_GPUS_PER_WORKER}_mp${MP_SIZE}"
+LOG_NAME="${JOB_NAME}_${host}_${current_time}"
+
+OUTPUT_BASEPATH="/vc_data_blob/users/conglli"
+mkdir -p "${OUTPUT_BASEPATH}/tensorboard/curriculum/"
+mkdir -p "${OUTPUT_BASEPATH}/checkpoint/curriculum/"
+mkdir -p "${OUTPUT_BASEPATH}/log/curriculum/"
+LOGDIR="${OUTPUT_BASEPATH}/tensorboard/curriculum/${LOG_NAME}"
+CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/curriculum/${JOB_NAME}"
+
+gpt_options=" \
+        --tensor-model-parallel-size ${MP_SIZE} \
+        --num-layers $NUM_LAYERS \
+        --hidden-size $HIDDEN_SIZE \
+        --num-attention-heads $NUM_ATTN_HEADS \
+        --seq-length $SEQ_LEN \
+        --max-position-embeddings $SEQ_LEN \
+        --micro-batch-size $BATCHSIZE \
+        --global-batch-size ${TOTAL_BATCHSIZE} \
+        --train-iters $NUM_ITER \
+        --train-tokens $NUM_TOKEN \
+        --lr-decay-tokens $LR_DECAY_TOKEN \
+        --save $CHECKPOINT_PATH \
+        --load $CHECKPOINT_PATH \
+        --data-path $DATA_PATH \
+        --vocab-file $VOCAB_PATH \
+        --merge-file $MERGE_PATH \
+        --data-impl mmap \
+        --split 949,50,1 \
+        --distributed-backend nccl \
+        --override-lr-scheduler \
+        --lr $LR \
+        --lr-decay-style cosine \
+        --min-lr 1.0e-5 \
+        --weight-decay 1e-2 \
+        --clip-grad 1.0 \
+        --lr-warmup-iters $LR_WARMUP_ITER \
+        --checkpoint-activations \
+        --log-interval 100 \
+        --save-interval $SAVE_INTERVAL \
+        --eval-interval 100 \
+        --eval-iters 10 \
+        --fp16 \
+        --seed $SEED \
+        --tensorboard-queue-size 1 \
+        --log-timers-to-tensorboard \
+        --log-batch-size-to-tensorboard \
+        --log-validation-ppl-to-tensorboard \
+        --no-masked-softmax-fusion \
+        --tensorboard-dir ${LOGDIR}
+"
+
+deepspeed_options=" \
+        --deepspeed \
+        --deepspeed_config ${config_json} \
+        --zero-stage ${stage} \
+        --pipeline-model-parallel-size ${PP_SIZE} \
+        --deepspeed-activation-checkpointing
+"
+
+full_options="${gpt_options} ${deepspeed_options}"
+
+run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER}  ../../pretrain_gpt.py ${full_options} &>> ${OUTPUT_BASEPATH}/log/curriculum/${JOB_NAME}.log"
+echo ${run_cmd}
+eval ${run_cmd}
+
+set +x
diff --git a/examples/curriculum_learning/ds_train.sh b/examples/curriculum_learning/ds_train.sh
@@ -0,0 +1,37 @@
+# # baseline
+# CONFIG=baseline
+# TAG=baseline
+# MODEL_SIZE=1558
+# LR=1.5e-4
+# BSZ=512
+# SEQ_LEN=1024
+# MP_SIZE=1
+# SEED=1234
+# SAVE_INTERVAL=5000
+# NUM_ITER=600000
+# NUM_TOKEN=157286400000
+# LR_DECAY_TOKEN=157286400000
+# LR_WARMUP_ITER=3000
+# CONFIG_TEMPLATE=false
+# CURRICULUM_STEP=0
+# CURRICULUM_MIN=0
+
+# curriculum learning
+CONFIG=curriculum_fixed_linear
+MODEL_SIZE=1558
+LR=6e-4
+BSZ=4096
+SEQ_LEN=1024
+MP_SIZE=1
+SEED=1234
+SAVE_INTERVAL=1000
+NUM_ITER=75000
+NUM_TOKEN=157286400000
+LR_DECAY_TOKEN=157286400000
+LR_WARMUP_ITER=3000
+CONFIG_TEMPLATE=true
+CURRICULUM_STEP=45000
+CURRICULUM_MIN=64
+TAG="${CONFIG}_s${CURRICULUM_MIN}to${SEQ_LEN}_step${CURRICULUM_STEP}"
+
+bash ds_pretrain_gpt2.sh $CONFIG $TAG $MODEL_SIZE $LR $BSZ $SEQ_LEN $MP_SIZE $SEED $SAVE_INTERVAL $NUM_ITER $NUM_TOKEN $LR_DECAY_TOKEN $LR_WARMUP_ITER $CONFIG_TEMPLATE $CURRICULUM_STEP $CURRICULUM_MIN
diff --git a/examples/curriculum_learning/ds_zero_stage_1_config_baseline.json b/examples/curriculum_learning/ds_zero_stage_1_config_baseline.json
@@ -0,0 +1,26 @@
+{
+  "train_batch_size": 512,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 1,
+  "zero_optimization": {
+    "stage": 1
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.00015,
+      "max_grad_norm": 1.0,
+      "betas": [0.9, 0.95]
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": false
+}
diff --git a/...les/curriculum_learning/ds_config_cl.json → ...age_1_config_curriculum_fixed_linear.json b/...les/curriculum_learning/ds_config_cl.json → ...age_1_config_curriculum_fixed_linear.json
@@ -3,7 +3,7 @@
   "gradient_accumulation_steps": 1,
   "steps_per_print": 1,
   "zero_optimization": {
-    "stage": 0
+    "stage": 1
   },
   "optimizer": {
     "type": "Adam",
@@ -26,11 +26,11 @@
   "curriculum_learning": {
     "enabled": true,
     "curriculum_type": "seqlen",
-    "min_difficulty": 8,
-    "max_difficulty": 1024,
+    "min_difficulty": CONFIG_CL_MIN,
+    "max_difficulty": CONFIG_CL_MAX,
     "schedule_type": "fixed_linear",
     "schedule_config": {
-      "total_curriculum_step": 60000,
+      "total_curriculum_step": CONFIG_CL_DURATION,
       "difficulty_step": 8
     }
   }

diff --git a/examples/curriculum_learning/pretrain_gpt_cl.sh b/examples/curriculum_learning/pretrain_gpt_cl.sh