training the 20 and 8 billion model failed on SUMMIT #115

agemagician · 2020-02-29T10:00:56Z

Hello,

I am trying to train the 8 billion and the 20 billion models on SUMMIT and both failed.
SUMMIT has 6 Nvidia V100 16GB GPUs per node.
Both the 8 billion and the 20 billion give oom.

The training command is:

export MP_SIZE=6

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 100 \
       --hidden-size 3720 \
       --num-attention-heads 30 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert_nccl.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 72 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

The config file is:

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  } 
}

I am testing it on 1 node and even after I reduced the train batch size to 1, it didn't work:


The logs are:
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 04:40:19.647170: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 04:40:22.566024 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 04:40:22.567073 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 04:40:22.567220 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 04:40:22.567455: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 04:40:22.570236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 04:40:22.572765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 04:40:22.575278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 04:40:22.577850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 04:40:22.580415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 04:40:22.582986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 04:40:22.583008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 04:40:22.583068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 04:40:22.583108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 04:40:22.583146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 04:40:22.585072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 04:40:22.585118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 04:40:22.585156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 04:40:22.615387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 04:40:22.623295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 04:40:22.623314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 04:40:22.646660 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 04:40:25.123421 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 04:40:25.123578 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 04:40:25.123658 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 04:40:25.149839: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 04:40:25.153336 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 04:40:25.153439 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 04:40:25.154995 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 04:40:25.166115 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:125722:125722 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125722:125722 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125722 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:125724:125724 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125726:125726 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125727:125727 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125723:125723 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125971 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125725:125725 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125725:125725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125725:125725 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125725:125992 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:125723:125993 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:125724:125994 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:125726:125995 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:125727:125996 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:125722:125971 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:125722:125971 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:125726:125995 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:125725:125992 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:125724:125994 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:125722:125971 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:125722:125971 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:125722:125722 [0] NCCL INFO Launch mode Parallel
building BERT model ...
h36n18:125726:125995 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:125723:125993 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
 > number of parameters on model parallel rank 0: 2799983247
h36n18:125722:126579 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125722:126579 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:125722:126579 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 5: 2799983247
 > number of parameters on model parallel rank 3: 2799983247
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 579, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_bert_nccl.py", line 170, in setup_model_and_optimizer
    optimizer = get_optimizer(model, args)
  File "pretrain_bert_nccl.py", line 141, in get_optimizer
    'delayed_shift': args.hysteresis})
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 198, in __init__
    master_param = param.detach().clone().float()
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.50 GiB already allocated; 16.94 MiB free; 373.95 MiB cached; 0 bytes inactive)
 > number of parameters on model parallel rank 2: 2799983247
 > number of parameters on model parallel rank 1: 2799983247
 > number of parameters on model parallel rank 4: 2799983247


  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 05:07:35.425203: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 05:07:38.074505 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 05:07:38.074888 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 05:07:38.075031 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 05:07:38.075261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 05:07:38.078041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 05:07:38.080565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 05:07:38.083095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 05:07:38.085669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 05:07:38.088239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 05:07:38.090805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 05:07:38.090827: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 05:07:38.090887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 05:07:38.090926: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 05:07:38.090965: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 05:07:38.092861: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 05:07:38.092907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 05:07:38.092946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 05:07:38.123406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 05:07:38.130912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 05:07:38.130926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 05:07:38.154345 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 05:07:39.526942 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 05:07:39.527102 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 05:07:39.527187 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 05:07:39.553327: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 05:07:39.556849 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 05:07:39.556953 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 05:07:39.559207 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 05:07:39.570396 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:127714:127714 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127714:127714 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127714 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:127718 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127719:127719 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127717:127717 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127963 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127716:127716 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127715:127715 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127984 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127715:127985 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127719:127986 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127717:127987 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127718:127988 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127714:127963 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:127963 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127715:127985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127717:127987 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127716:127984 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:127714:127963 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:127963 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127715:127985 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:127988 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
building BERT model ...
 > number of parameters on model parallel rank 0: 1381032967
 > number of parameters on model parallel rank 1: 1381032967
 > number of parameters on model parallel rank 5: 1381032967
 > number of parameters on model parallel rank 3: 1381032967
 > number of parameters on model parallel rank 2: 1381032967
h36n18:127714:128267 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127714:128267 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127714:128267 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 4: 1381032967
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127715:128279 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127715:128279 [1] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127715:128279 [1] NCCL INFO comm 0x2001c8006620 rank 0 nranks 1 cudaDev 1 nvmlDev 1 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127719:128281 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128281 [5] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127719:128281 [5] NCCL INFO comm 0x2001ec006620 rank 0 nranks 1 cudaDev 5 nvmlDev 5 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127716:128283 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127716:128283 [2] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127716:128283 [2] NCCL INFO comm 0x200340006620 rank 0 nranks 1 cudaDev 2 nvmlDev 2 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127717:128286 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127717:128286 [3] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127717:128286 [3] NCCL INFO comm 0x200320006620 rank 0 nranks 1 cudaDev 3 nvmlDev 3 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:128288 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127718:128288 [4] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127718:128288 [4] NCCL INFO comm 0x2001f4006620 rank 0 nranks 1 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127718:128337 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128338 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127715:128339 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127717:128341 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127716:128340 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127714:128336 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:128336 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127719:128338 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO comm 0x200408006620 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127715:128339 [1] NCCL INFO comm 0x200424006620 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:128337 [4] NCCL INFO comm 0x200410006620 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127717:128341 [3] NCCL INFO comm 0x20033c006620 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:128336 [0] NCCL INFO comm 0x200718006620 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127716:128340 [2] NCCL INFO comm 0x20035c006620 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
learning rate decaying linear
Partition Activations False and Correctness Check False
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.75 GiB total capacity; 14.04 GiB already allocated; 580.94 MiB free; 200.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 2; 15.75 GiB total capacity; 14.13 GiB already allocated; 586.94 MiB free; 188.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 15.75 GiB total capacity; 14.13 GiB already allocated; 582.88 MiB free; 192.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 5; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 4; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 3; 15.75 GiB total capacity; 14.16 GiB already allocated; 558.94 MiB free; 192.72 MiB cached; 0 bytes inactive)

From my understanding from the paper on table 8 that you were able to train both the 8 and 20 billion models on 4 x 16GB GPU using 4 way model parallelism.
In my case I am using 6 way model parallelism with batch size 1 and it dosn't work.

Did I miss understood something?
Do you have any idea how to make it work ?

The text was updated successfully, but these errors were encountered:

agemagician · 2020-02-29T18:20:10Z

ok, so I double-checked.
Apparently, only GPT2 model includes deepspeed support on the examples.

@RezaYazdaniAminabadi

* Fix the layer-past for GPT based models (microsoft#2196) * Add gradient_average flag support for sparse grads (microsoft#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Log user config exactly (microsoft#2201) * Fix the tensor-slicing copy for qkv parameters (microsoft#2198) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor Distributed Tests (microsoft#2180) Refactor Distributed unit tests * fix table syntax (microsoft#2204) Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Correctly detect offload configuration (microsoft#2208) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add cuda 11.7 (microsoft#2211) * add cuda 11.7 * formatting * use torch 1.9 (microsoft#2215) * [zero-3] print warning once and support torch parameter (microsoft#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add support of OPT models (microsoft#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * fix typos in readme. (microsoft#2218) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (microsoft#2225) * add doc for new bert example (microsoft#2224) * Remove the random-generator from context during inference (microsoft#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237) * Correctly detect zero_offload (microsoft#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * update videos (microsoft#2249) * Refactor dist tests: Checkpointing (microsoft#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254) * fix ds-inference without policy (microsoft#2247) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (microsoft#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (microsoft#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (microsoft#2259) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [precommit] fix pre-commit issues * Update half precision header guards (microsoft#2261) * fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.3 * Add blob storage to CI runners (microsoft#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (microsoft#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (microsoft#2267) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Ds inference/fix mp2 (microsoft#2270) * Trajepl: nebula load fix (microsoft#2182) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: chenguo <chenguo@microsoft.com> * prevent torch ext folder mkdir at tmp (microsoft#2274) * Ds-inference Int8 support through ZeroQuant technology (microsoft#2217) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add a new unit test for cuda ops (microsoft#2278) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> * Add to codeowners file (microsoft#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (microsoft#2276) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fp32 accuracy bug fix (microsoft#2285) Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> * Refactor universal checkpointing and tensor fragments (microsoft#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (microsoft#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Offload all gradients to nvme (microsoft#2282) * fused bias relu unittest (microsoft#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289) Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (microsoft#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (microsoft#2306) * Add tensor parallel inference unit tests (microsoft#2232) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> * Fix the residual add mp scaling for GPTNeoX (microsoft#2310) * Add unit tests for residual_add kernels (microsoft#2307) * add inference eval scripts (microsoft#2303) * Upgrade P40 tests to torch 1.8 (microsoft#2316) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog (microsoft#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog - wrap up (microsoft#2321) * ZeRO-Inference blog - Update README (microsoft#2322) * refactor to use mem_access (microsoft#2317) * add quant unit test (microsoft#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * only override forward if using cuda-graph (microsoft#2291) * Add more options to inference benchmark (microsoft#2325) * bump to 0.7.4 * MOE residual matmult unit test (microsoft#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (microsoft#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (microsoft#2333) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (microsoft#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * increase min pre-commit versions (microsoft#2346) * Extend scratch buffer for long prompts (microsoft#2212) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix zero docs (microsoft#2350) * Inference profiling updates/fixes (microsoft#2348) (microsoft#2349) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Kernel Data Conversion Utility (microsoft#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (microsoft#2340) Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [accelerator abstraction] merge from microsoft#2320 * docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * download cifar to blob storage (microsoft#2342) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor gptj_residual_add kernels for better readability (microsoft#2358) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Updated issue templates (microsoft#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (microsoft#2362) * format * remove round fn * Add missing pytest fixture scope (microsoft#2353) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor fused_bias_residual kernels for better readability (microsoft#2356) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Capture error message during sweep tests (microsoft#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix an exception when recursively casting dicts to fp16 (microsoft#2370) * Refactor remaining distributed tests (microsoft#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix the MLP output tensor's shape (microsoft#2380) * allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390) * pin transformers version for unit tests (microsoft#2402) * Change type to tuple in replace_wo_policy isinstance check (microsoft#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Checkpoint backwards-compatbility workaround (microsoft#2384) * Add predicated global load (microsoft#2373) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (microsoft#2418) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fix figure reference (microsoft#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (microsoft#2404) Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issue with corrupted output on long generation for GPT (microsoft#2359) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (microsoft#2401) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (microsoft#2422) * [docs] update mii blog title (microsoft#2423) * add SD injection policy (microsoft#2381) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (microsoft#2425) * Make error regex more generic in collect_results.py (microsoft#2415) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixes microsoft#2389 (microsoft#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix for inference gpt-j test (microsoft#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (microsoft#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Universal checkpoint for zero stage 1 (microsoft#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * only add deps if extra is explictly called (microsoft#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387). Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * [memory estimators] new config args sync (microsoft#2431) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix broken link to DeepSpeed Megatron fork (microsoft#2440) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (microsoft#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (microsoft#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Mikhail Druzhinin <dipetm@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Kamal Raj <kamalraj97@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Zhihong Chen <gdst_czh@163.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: 叶志晟 <yzs981130@126.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: trajep <trajepl@gmail.com> Co-authored-by: chenguo <chenguo@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com> Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> Co-authored-by: Matt Smith <matt@mjksmith.com> Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com> Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com> Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com> Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Adam Moody <moody20@llnl.gov> Co-authored-by: AGUL <mingzhi.liu@intel.com>

agemagician changed the title ~~training the 20 and 8 billion model failed~~ training the 20 and 8 billion model failed on SUMMIT Feb 29, 2020

agemagician closed this as completed Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training the 20 and 8 billion model failed on SUMMIT #115

training the 20 and 8 billion model failed on SUMMIT #115

agemagician commented Feb 29, 2020 •

edited

agemagician commented Feb 29, 2020

training the 20 and 8 billion model failed on SUMMIT #115

training the 20 and 8 billion model failed on SUMMIT #115

Comments

agemagician commented Feb 29, 2020 • edited

agemagician commented Feb 29, 2020

agemagician commented Feb 29, 2020 •

edited