performance---the training time of transformer model is too long in mixed-precision model #203
Comments
(1) Are you using Volta GPU? (Only Volta has Tensor Cores used in mixed precision) |
1). Yes. We use V100, all the descriptions above is performance in V100. When we use the default Transformer based model(https://github.com/tensorflow/tensor2tensor), we get the following speed: with OpenSeq2Seq (default configuration), we get |
GPU utilization of 1% means that most of the work falls on CPU. This happens because public Tensorflow + CUDA 9.0 does not have batch gemm in float16 integrated. I would recommend you just use NVIDIA's Tensorflow container (18.07-py3) which you can get here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow . It contains cublas, cuda, cudnn + TF version tested to work with each other nicely and occasionally some GPU improvements which aren't in TF upstream yet. This way you don't need to worry about details like above. |
I am not able to get speedups using mixed precision. After upgrading CUDA from 9.0 to 9.2, we have another problem, i,e,: System information : Exact command to reproduce:
Describe the problem : logs [[Node: MatMul_1 = BatchMatMul[T=DT_HALF, adj_x=false, adj_y=false, _device="/device:GPU:0"](a_1, b_1)]] so, what can we do? |
Can you please try NVIDIA's TensorFlow container (18.07-py3)? You can get it here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow I am not sure why upstream TF still doesn't have batched gemm in fp16 ... |
Thank you @okuchaiev . We cannot access this website. Is there any other ways (like GoogleDrive) to access this container? |
The website seems up and running for me https://ngc.nvidia.com NVIDIA's TF containers are available only from there - it requires registration but it is quick and free |
we have addressed the problem that FP16 matmul can not run on GPU, but but almost no speedup: System information : model : Openseq2seq---transformer_big.py the speed of FP32 and mixed almost identical, why the mixed mode can not speedup the transformer model ? |
@dingsiyu This is using NVIDIA's TF containers, 2 GPUs, OpenSeq2Seq from master branch and not using Horovod. One thing I noticed is that my FP32 model reports around 0.424 time per step, while mixed has time closer to 0.33 (same is yours). |
@okuchaiev But i do not know if our System informations are same. I am not sure if CUDA 9.2 will influence the speed of FP32 mode. so can you test "transformer-big.py" on the newest NVIDIA's TF containers which may have the same config with me. thanks a lot ! |
when i train the transformer in mixed-precision model, the time is so long, i.e., (1) the transformer_big.py : two GPU (V100), the parameters are all default, time per step: 13s,
(2) tensor2tensor (big model): two GPU (V100), the parameters are all default, time per step: 0.3s,but as present in mixed-precision training document (https://arxiv.org/pdf/1710.03740.pdf), the training time should be shorter than that in FP32 model. So, why?
The text was updated successfully, but these errors were encountered: