-
Notifications
You must be signed in to change notification settings - Fork 225
Open
Description
We run BERT-Large training on bare metal ubuntu server. The log have no errors, but also no training logs, it is confusing.
command:
python ./launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--benchmark-only \
--data-location=$BERT_LARGE_DIR \
--num-inter-threads=1 \
-- train-option=SQuAD DEBIAN_FRONTEND=noninteractive config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json predict_file=$SQUAD_DIR/dev-v1.1.json do-train=True learning-rate=1.5e-5 max-seq-length=384 do_predict=True warmup-steps=0 num_train_epochs=0.1 doc_stride=128 do_lower_case=False experimental-gelu=False mpi_workers_sync_gradients=True
The log:
INFO:tensorflow:Graph was finalized.
I0625 09:40:30.595448 140247941625664 monitored_session.py:246] Graph was finalized.
2021-06-25 09:40:30.595915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 09:40:30.764862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2892875000 Hz
2021-06-25 09:40:30.767997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c703127e80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-25 09:40:30.768068: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Running local_init_op.
I0625 09:40:50.980941 140247941625664 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0625 09:40:51.142987 140247941625664 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I0625 09:41:02.433922 140247941625664 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
I0625 09:41:02.434337 140247941625664 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I0625 09:41:08.454857 140247941625664 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:Running SQuAD...!
----------------------------Run command-------------------------------------
So there are no training result in the log.
@dmsuehir @ashahba would you please help troubleshoot
Thanks
Metadata
Metadata
Assignees
Labels
No labels