New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horovod GradientTape performance for Tensorflow #1177
Comments
This is not expected. @tgaddair, could you take a look? |
@aj-prime, by the way, any reason you're not using NCCL? |
@alsrgv No specific reason |
@aj-prime, NCCL will give you better performance on GPU compared to MPI. That said, I do see a slowdown of Eager TF compared to a regular TF even with NCCL in my environment. |
@alsrgv I am trying to setup horovod with NCCL. Is the slowdown as severe as MV2-GDR? |
@aj-prime, I have not tried MV2-GDR, so not sure. What kind of performance are you seeing with Graph TensorFlow MNIST example? |
@alsrgv For tensorflow_mnist.py script with batch size 32, I am getting following numbers |
OK, that's much better. We'll look into the eager mode performance. |
Hey @aj-prime, can you try running tensorflow_synthetic_benchmark.py with |
hello @tgaddair I ran tensorflow_synthetic_benchmark.py in the graph and eager mode. Here are the results Graph Mode Eager Mode |
Those numbers look a lot better, though still not great (but there's some significant performance penalties to eager execution in TensorFlow at present). One thing we do in the synthetic benchmarks but not the MNIST example is device placement: Without device placement, allreduce happens on CPU, which can slow things down considerably. Can you try adding It could also be simply due to the fact that ResNet50 is a more complex model, so more of the time will be spent in computation vs communication. I'll see if I can repro on our end. |
@aj-prime, we figured out why There is another issue in eager mode though. In graph mode, TensorFlow can start reducing gradients for layers close to the loss while the rest of gradients are still getting computed. This ensures proper ordering of allreduce operations. In eager mode, allreduce starts after all the gradients are computed, which causes an additional delay & randomized ordering of gradient reductions. Because of that, it's recommended to wrap the whole training step in @tf.function
def training_step(images, labels, first_batch):
with tf.GradientTape() as tape:
logits = mnist_model(images, training=True)
loss_value = loss(labels, logits)
# Horovod: add Horovod Distributed GradientTape.
tape = hvd.DistributedGradientTape(tape)
grads = tape.gradient(loss_value, mnist_model.trainable_variables)
opt.apply_gradients(zip(grads, mnist_model.trainable_variables))
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
if first_batch:
hvd.broadcast_variables(mnist_model.variables, root_rank=0)
hvd.broadcast_variables(opt.variables(), root_rank=0)
return loss_value |
Environment:
Your question:
I have modified tensorflow_mnist_eager.py script in the examples to print images per sec.
Batch size: 32
#GPUs Perf(images/sec)
1 2216
2 425
4 640
It looks like there is an initial overhead of distributing the DNN training.
Is it the expected behavior?
The text was updated successfully, but these errors were encountered: