PyTorch Synthetic Benchmark #545

alsrgv · 2018-10-06T20:41:45Z

No description provided.

alsrgv · 2018-10-06T20:43:24Z

examples/pytorch_synthetic_benchmark.py

+# Horovod: broadcast parameters & optimizer state.
+hvd.broadcast_parameters(model.state_dict(), root_rank=0)
+# TODO: needs bugfix
+#hvd.broadcast_optimizer_state(optimizer, root_rank=0)


@tgaddair, we should fix this bug before landing. optim.SGD w/o momentum & weight decay cause the following issue with broadcast_optimizer_state:

Traceback (most recent call last): File "pytorch_synthetic_benchmark.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "/usr/local/lib/python2.7/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] KeyError: 4592867280

tgaddair · 2018-10-08T15:31:51Z

examples/pytorch_synthetic_benchmark.py

+# Horovod: broadcast parameters & optimizer state.
+hvd.broadcast_parameters(model.state_dict(), root_rank=0)
+# TODO: needs bugfix
+#hvd.broadcast_optimizer_state(optimizer, root_rank=0)


bapriddy · 2018-10-19T23:09:47Z

@alsrgv I'm getting the following error when running the pytorch_synthetic_benchmark.py

Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] KeyError: 140230636548384

I get the error in parallel as well. Same output k times...For example

aprun -n 4 -N 1 ~/miniconda3/bin/python syn.py

Gives
Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] param_state = state_dict['state'][pid] KeyError: 46913509835904 KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904

I've included the reference call below in /torch/init.py

def _create_callback(pid, name, t, p):
def _from_tensor():
state_dict['state'][pid][name] = t(p.numpy()[0])
return _from_tensor

alsrgv · 2018-10-19T23:24:01Z

@bapriddy, can you comment out this line or install Horovod from master?

bapriddy · 2018-10-20T13:15:03Z

Yes. I'll check it. Thanks.

bapriddy · 2018-10-20T13:19:47Z

It worked. Thanks!!

bapriddy · 2018-10-20T13:26:19Z

@alsrgv Awesome!!! So nice to have this. Again, Thanks!!!

bapriddy · 2018-10-20T13:49:53Z

@alsrgv How does the code decide when to stop "warmup" and proceed with the test?? Just curious. Also, would switching to fp16 have any effect?

alsrgv · 2018-10-22T03:30:24Z

@bapriddy, warmup runs for --num-warmup-batches, which defaults to 10. Using --fp16-allreduce should improve performance if your network is slow.

bapriddy · 2018-10-22T18:35:55Z

@alsrgv Is it possible to modify the pytorch_synthetic_benchmark.py for resnet18, resnet101, or other imagenet models? I did this with pytorch_imagenet_resnet50.py by changing line 114.

# Set up standard ResNet-50 model.
model = models.resnet50()

alsrgv · 2018-10-22T18:47:20Z

@bapriddy, yeah, you can just pass --model resnet101.

bapriddy · 2018-10-22T19:02:03Z

@alsrgv Got it!

alsrgv self-assigned this Oct 6, 2018

alsrgv requested a review from tgaddair October 6, 2018 20:41

PyTorch Synthetic Benchmark

36618c9

alsrgv force-pushed the pytorch_benchmark branch from cc8b87e to 36618c9 Compare October 6, 2018 20:42

alsrgv commented Oct 6, 2018

View reviewed changes

tgaddair approved these changes Oct 8, 2018

View reviewed changes

Resolved TODO

1b1cb0a

alsrgv merged commit 983a06e into master Oct 8, 2018

alsrgv deleted the pytorch_benchmark branch October 8, 2018 16:57

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019

PyTorch Synthetic Benchmark (horovod#545)

9a3b15a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Synthetic Benchmark #545

PyTorch Synthetic Benchmark #545

alsrgv commented Oct 6, 2018

alsrgv Oct 6, 2018

tgaddair Oct 8, 2018

tgaddair Oct 8, 2018

bapriddy commented Oct 19, 2018

alsrgv commented Oct 19, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

alsrgv commented Oct 22, 2018

bapriddy commented Oct 22, 2018

alsrgv commented Oct 22, 2018

bapriddy commented Oct 22, 2018

PyTorch Synthetic Benchmark #545

PyTorch Synthetic Benchmark #545

Conversation

alsrgv commented Oct 6, 2018

alsrgv Oct 6, 2018

Choose a reason for hiding this comment

tgaddair Oct 8, 2018

Choose a reason for hiding this comment

tgaddair Oct 8, 2018

Choose a reason for hiding this comment

bapriddy commented Oct 19, 2018

alsrgv commented Oct 19, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

bapriddy commented Oct 20, 2018

alsrgv commented Oct 22, 2018

bapriddy commented Oct 22, 2018

alsrgv commented Oct 22, 2018

bapriddy commented Oct 22, 2018