forked from microsoft/DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
7 changed files
with
223 additions
and
142 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1 @@ | ||
This is a short tutorial of how to use/tune the curriculum learning (CL) integration. Currently it is only integrated for GPT pre-training. For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084). | ||
|
||
# Disable batch size warmup (--rampup-batch-size) | ||
In our [paper](https://arxiv.org/abs/2108.06084) section 5.4 we demonstrate that curriculum learning (seqlen-based) provides much better training stability than the batch size warmup technique. So when using CL you need to remove the `--rampup-batch-size` config in your training script. It's not recommended to use both CL and batch size warmup, because both of them will reduce the number of tokens in a batch. Another related change you might want is to increase your micro batch size, since without batch size warmup your batch size will be fixed now. | ||
|
||
# Token-based training termination | ||
|
||
Because CL changes length of each sequence/sample during training, it is very hard/impossible to use number of steps/samples to terminate the training exactly at the desired number of tokens. Thus we add a `--train-tokens` config as an alternative accurate token-based termination. We recommend increase your original `--train-samples` or `--train-iters` to a large enough number (e.g., 2X of what you used for baseline), and set `--train-tokens` at the exact desired number of training tokens (e.g., 300B for GPT-3 like training). | ||
|
||
# Token-based LR decay | ||
|
||
Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). Then you need to replace `--lr-decay-samples` with `--lr-decay-tokens` in your script. | ||
|
||
# LR warmup adjustment | ||
|
||
For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful. However, you may need to adjust your `--lr-warmup-samples` or `--lr-warmup-iters` from non-CL cases for various reasons (e.g., if you used `--rampup-batch-size` in non-CL case, for CL we don't use it so the number of samples per batch will be different at beginning). Assuming you want to use `X` tokens to warmup the LR (for OpenAI GPT-3 this was 375M tokens), then for CL case you shall set `--lr-warmup-samples` as `X` divided by the `min_difficulty` below, or set `--lr-warmup-iters` as `X` divided by `min_difficulty * --global-batch-size`. This is a rough estimation based on that CL starts from seqlen `min_difficulty` and it won't increase too much during LR warmup. | ||
|
||
# Token-based tensorboard | ||
|
||
Because of the above changes, we also add token-based tensorboard scalars. We also add scalars that plot the seqlen at each step. | ||
|
||
# Curriculum learning hyperparameters tuning strategy | ||
|
||
The curriculum learning hyperparameters are all located in the deepspeed config json file (see the example `ds_config_cl.json` in this dir). There are a few config entries that you may need to adjust to your circumstances, and two of which require some tuning. In our [paper](https://arxiv.org/abs/2108.06084) Appendix A.1 we have a more detailed tuning strategy description. | ||
|
||
1. `max_difficulty` should be set as the full seqlen (i.e., your `--seq-length`). No need to tune this. | ||
|
||
2. `min_difficulty` is the beginning seqlen used by CL. In general smaller `min_difficulty` could provide better stability/convergence speed benefit. However we observe that for a larger model or for different training data, starting from a very small seqlen could lead to significant validation PPL fluctuation (or even divergence) at the very beginning. We recommend to start with `min_difficulty` at 64, and then increase it if you observe problems at the very beginning. Note that to enable Tensor Core acceleration you should always use a multiple of 8. | ||
|
||
3. `total_curriculum_step` is the total number of steps used by CL. In general larger `total_curriculum_step` could provide better stability/convergence speed benefit. However we observe that a too large `total_curriculum_step` could lead to overfitting and significant validation PPL fluctuation (or even divergence) at the first few multiple of LR warmup steps. In our paper we have a detailed tuning strategy based on binary search. However, if you want to reduce the tuning effort we recommend directly setting `total_curriculum_step` as half of baseline's total number of steps. This may not provide the highest convergence speed benefit, but should provide enough training stability gains. | ||
|
||
4. `difficulty_step` is the change in seq length per CL step. A smaller value is preferable since it gives more smooth CL and better stability. Like `min_difficulty` it too needs to be multiple of 8 for Tensor core acceleration, thus 8 is a good default. | ||
This is an example of how to use DeepSpeed's curriculum learning (CL) feature which provides faster and more stable language model pre-training. Currently it is only integrated for GPT pre-training. Note that there are two curriculum learning examples in two different repos for Megatron-LM GPT-2 pre-training. Both of them have some unique features and limitations. See details in our [tutorial](https://www.deepspeed.ai/tutorials/curriculum-learning/). For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
#! /bin/bash | ||
|
||
CONFIG=$1 | ||
TAG=$2 | ||
MODEL_SIZE=$3 | ||
LR=$4 | ||
TOTAL_BATCHSIZE=$5 | ||
SEQ_LEN=$6 | ||
MP_SIZE=$7 | ||
SEED=$8 | ||
SAVE_INTERVAL=$9 | ||
NUM_ITER=${10} | ||
NUM_TOKEN=${11} | ||
LR_DECAY_TOKEN=${12} | ||
LR_WARMUP_ITER=${13} | ||
CONFIG_TEMPLATE=${14} | ||
CURRICULUM_STEP=${15} | ||
CURRICULUM_MIN=${16} | ||
|
||
# 12-layer, 768-hidden, 12-heads, 117M parameters | ||
# 24-layer, 1024-hidden, 16-heads, 345M parameters | ||
# 36-layer, 1280-hidden, 20-heads, 774M parameters | ||
# 48-layer, 1600-hidden, 25-heads, 1558M parameters | ||
if [[ $MODEL_SIZE -eq 117 ]]; then | ||
NUM_LAYERS=12 | ||
HIDDEN_SIZE=768 | ||
NUM_ATTN_HEADS=12 | ||
elif [[ $MODEL_SIZE -eq 345 ]]; then | ||
NUM_LAYERS=24 | ||
HIDDEN_SIZE=1024 | ||
NUM_ATTN_HEADS=16 | ||
elif [[ $MODEL_SIZE -eq 774 ]]; then | ||
NUM_LAYERS=36 | ||
HIDDEN_SIZE=1280 | ||
NUM_ATTN_HEADS=20 | ||
elif [[ $MODEL_SIZE -eq 1558 ]]; then | ||
NUM_LAYERS=48 | ||
HIDDEN_SIZE=1600 | ||
NUM_ATTN_HEADS=25 | ||
else | ||
echo "Model size not supported." | ||
exit 1 | ||
fi | ||
|
||
# Pipeline parallelism. 1 means no pipelines. | ||
PP_SIZE=1 | ||
|
||
# Change for multinode config | ||
NUM_WORKERS=16 | ||
NUM_GPUS_PER_WORKER=8 | ||
NUM_GPUS=$(( ${NUM_WORKERS} * ${NUM_GPUS_PER_WORKER} )) | ||
if [[ $PP_SIZE -gt 0 ]]; then | ||
DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) )) | ||
else | ||
DP_SIZE=$(( ${NUM_GPUS} / ${MP_SIZE} )) | ||
fi | ||
# Batch size per gpu, here we assume grad accumulation step 1 | ||
# you can reduce this if gpu OOM | ||
BATCHSIZE=$((TOTAL_BATCHSIZE/DP_SIZE)) | ||
|
||
DATA_PATH=/vc_data/Megatron-LM/data/indexed_datasets/megatron | ||
VOCAB_PATH=/vc_data/Megatron-LM/data/gpt2-vocab.json | ||
MERGE_PATH=/vc_data/Megatron-LM/data/gpt2-merges.txt | ||
|
||
#ZeRO Configs | ||
stage=1 | ||
|
||
current_time=$(date "+%Y.%m.%d-%H.%M.%S") | ||
script_path=$(realpath $0) | ||
script_dir=$(dirname $script_path) | ||
host="${HOSTNAME}" | ||
|
||
if [ "${CONFIG_TEMPLATE}" = "true" ]; then | ||
template_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json" | ||
config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}_min${CURRICULUM_MIN}_max${SEQ_LEN}_step${CURRICULUM_STEP}.json" | ||
sed "s/CONFIG_CL_MIN/${CURRICULUM_MIN}/" ${template_json} \ | ||
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \ | ||
| sed "s/CONFIG_CL_DURATION/${CURRICULUM_STEP}/" \ | ||
> ${config_json} | ||
else | ||
config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json" | ||
fi | ||
|
||
JOB_NAME="gpt2_${MODEL_SIZE}M_bsz${TOTAL_BATCHSIZE}_seq${SEQ_LEN}_lr${LR}_warmup${LR_WARMUP_ITER}_decay${LR_DECAY_TOKEN}_seed${SEED}_${TAG}_stage${stage}_n${NUM_WORKERS}_g${NUM_GPUS_PER_WORKER}_mp${MP_SIZE}" | ||
LOG_NAME="${JOB_NAME}_${host}_${current_time}" | ||
|
||
OUTPUT_BASEPATH="/vc_data_blob/users/conglli" | ||
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/curriculum/" | ||
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/curriculum/" | ||
mkdir -p "${OUTPUT_BASEPATH}/log/curriculum/" | ||
LOGDIR="${OUTPUT_BASEPATH}/tensorboard/curriculum/${LOG_NAME}" | ||
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/curriculum/${JOB_NAME}" | ||
|
||
gpt_options=" \ | ||
--tensor-model-parallel-size ${MP_SIZE} \ | ||
--num-layers $NUM_LAYERS \ | ||
--hidden-size $HIDDEN_SIZE \ | ||
--num-attention-heads $NUM_ATTN_HEADS \ | ||
--seq-length $SEQ_LEN \ | ||
--max-position-embeddings $SEQ_LEN \ | ||
--micro-batch-size $BATCHSIZE \ | ||
--global-batch-size ${TOTAL_BATCHSIZE} \ | ||
--train-iters $NUM_ITER \ | ||
--train-tokens $NUM_TOKEN \ | ||
--lr-decay-tokens $LR_DECAY_TOKEN \ | ||
--save $CHECKPOINT_PATH \ | ||
--load $CHECKPOINT_PATH \ | ||
--data-path $DATA_PATH \ | ||
--vocab-file $VOCAB_PATH \ | ||
--merge-file $MERGE_PATH \ | ||
--data-impl mmap \ | ||
--split 949,50,1 \ | ||
--distributed-backend nccl \ | ||
--override-lr-scheduler \ | ||
--lr $LR \ | ||
--lr-decay-style cosine \ | ||
--min-lr 1.0e-5 \ | ||
--weight-decay 1e-2 \ | ||
--clip-grad 1.0 \ | ||
--lr-warmup-iters $LR_WARMUP_ITER \ | ||
--checkpoint-activations \ | ||
--log-interval 100 \ | ||
--save-interval $SAVE_INTERVAL \ | ||
--eval-interval 100 \ | ||
--eval-iters 10 \ | ||
--fp16 \ | ||
--seed $SEED \ | ||
--tensorboard-queue-size 1 \ | ||
--log-timers-to-tensorboard \ | ||
--log-batch-size-to-tensorboard \ | ||
--log-validation-ppl-to-tensorboard \ | ||
--no-masked-softmax-fusion \ | ||
--tensorboard-dir ${LOGDIR} | ||
" | ||
|
||
deepspeed_options=" \ | ||
--deepspeed \ | ||
--deepspeed_config ${config_json} \ | ||
--zero-stage ${stage} \ | ||
--pipeline-model-parallel-size ${PP_SIZE} \ | ||
--deepspeed-activation-checkpointing | ||
" | ||
|
||
full_options="${gpt_options} ${deepspeed_options}" | ||
|
||
run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} ../../pretrain_gpt.py ${full_options} &>> ${OUTPUT_BASEPATH}/log/curriculum/${JOB_NAME}.log" | ||
echo ${run_cmd} | ||
eval ${run_cmd} | ||
|
||
set +x |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# # baseline | ||
# CONFIG=baseline | ||
# TAG=baseline | ||
# MODEL_SIZE=1558 | ||
# LR=1.5e-4 | ||
# BSZ=512 | ||
# SEQ_LEN=1024 | ||
# MP_SIZE=1 | ||
# SEED=1234 | ||
# SAVE_INTERVAL=5000 | ||
# NUM_ITER=600000 | ||
# NUM_TOKEN=157286400000 | ||
# LR_DECAY_TOKEN=157286400000 | ||
# LR_WARMUP_ITER=3000 | ||
# CONFIG_TEMPLATE=false | ||
# CURRICULUM_STEP=0 | ||
# CURRICULUM_MIN=0 | ||
|
||
# curriculum learning | ||
CONFIG=curriculum_fixed_linear | ||
MODEL_SIZE=1558 | ||
LR=6e-4 | ||
BSZ=4096 | ||
SEQ_LEN=1024 | ||
MP_SIZE=1 | ||
SEED=1234 | ||
SAVE_INTERVAL=1000 | ||
NUM_ITER=75000 | ||
NUM_TOKEN=157286400000 | ||
LR_DECAY_TOKEN=157286400000 | ||
LR_WARMUP_ITER=3000 | ||
CONFIG_TEMPLATE=true | ||
CURRICULUM_STEP=45000 | ||
CURRICULUM_MIN=64 | ||
TAG="${CONFIG}_s${CURRICULUM_MIN}to${SEQ_LEN}_step${CURRICULUM_STEP}" | ||
|
||
bash ds_pretrain_gpt2.sh $CONFIG $TAG $MODEL_SIZE $LR $BSZ $SEQ_LEN $MP_SIZE $SEED $SAVE_INTERVAL $NUM_ITER $NUM_TOKEN $LR_DECAY_TOKEN $LR_WARMUP_ITER $CONFIG_TEMPLATE $CURRICULUM_STEP $CURRICULUM_MIN |
26 changes: 26 additions & 0 deletions
26
examples/curriculum_learning/ds_zero_stage_1_config_baseline.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"train_batch_size": 512, | ||
"gradient_accumulation_steps": 1, | ||
"steps_per_print": 1, | ||
"zero_optimization": { | ||
"stage": 1 | ||
}, | ||
"optimizer": { | ||
"type": "Adam", | ||
"params": { | ||
"lr": 0.00015, | ||
"max_grad_norm": 1.0, | ||
"betas": [0.9, 0.95] | ||
} | ||
}, | ||
"gradient_clipping": 1.0, | ||
"fp16": { | ||
"enabled": true, | ||
"loss_scale": 0, | ||
"loss_scale_window": 1000, | ||
"hysteresis": 2, | ||
"min_loss_scale": 1 | ||
}, | ||
"wall_clock_breakdown": false, | ||
"zero_allow_untested_optimizer": false | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.