Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Synthetic Benchmark #545

Merged
merged 2 commits into from Oct 8, 2018
Merged

PyTorch Synthetic Benchmark #545

merged 2 commits into from Oct 8, 2018

Conversation

alsrgv
Copy link
Member

@alsrgv alsrgv commented Oct 6, 2018

No description provided.

@alsrgv alsrgv self-assigned this Oct 6, 2018
@alsrgv alsrgv requested a review from tgaddair October 6, 2018 20:41
# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# TODO: needs bugfix
#hvd.broadcast_optimizer_state(optimizer, root_rank=0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair, we should fix this bug before landing. optim.SGD w/o momentum & weight decay cause the following issue with broadcast_optimizer_state:

Traceback (most recent call last):
  File "pytorch_synthetic_benchmark.py", line 64, in <module>
    hvd.broadcast_optimizer_state(optimizer, root_rank=0)
  File "/usr/local/lib/python2.7/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state
    param_state = state_dict['state'][pid]
KeyError: 4592867280

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #548.

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# TODO: needs bugfix
#hvd.broadcast_optimizer_state(optimizer, root_rank=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #548.

@alsrgv alsrgv merged commit 983a06e into master Oct 8, 2018
@alsrgv alsrgv deleted the pytorch_benchmark branch October 8, 2018 16:57
@bapriddy
Copy link
Contributor

@alsrgv I'm getting the following error when running the pytorch_synthetic_benchmark.py

Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] KeyError: 140230636548384

I get the error in parallel as well. Same output k times...For example

aprun -n 4 -N 1 ~/miniconda3/bin/python syn.py

Gives
Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] param_state = state_dict['state'][pid] KeyError: 46913509835904 KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904

I've included the reference call below in /torch/init.py

def _create_callback(pid, name, t, p):
def _from_tensor():
state_dict['state'][pid][name] = t(p.numpy()[0])
return _from_tensor

@alsrgv
Copy link
Member Author

alsrgv commented Oct 19, 2018

@bapriddy, can you comment out this line or install Horovod from master?

@bapriddy
Copy link
Contributor

Yes. I'll check it. Thanks.

@bapriddy
Copy link
Contributor

It worked. Thanks!!

@bapriddy
Copy link
Contributor

@alsrgv Awesome!!! So nice to have this. Again, Thanks!!!

@bapriddy
Copy link
Contributor

@alsrgv How does the code decide when to stop "warmup" and proceed with the test?? Just curious. Also, would switching to fp16 have any effect?

@alsrgv
Copy link
Member Author

alsrgv commented Oct 22, 2018

@bapriddy, warmup runs for --num-warmup-batches, which defaults to 10. Using --fp16-allreduce should improve performance if your network is slow.

@bapriddy
Copy link
Contributor

@alsrgv Is it possible to modify the pytorch_synthetic_benchmark.py for resnet18, resnet101, or other imagenet models? I did this with pytorch_imagenet_resnet50.py by changing line 114.

# Set up standard ResNet-50 model.
model = models.resnet50()

@alsrgv
Copy link
Member Author

alsrgv commented Oct 22, 2018

@bapriddy, yeah, you can just pass --model resnet101.

@bapriddy
Copy link
Contributor

@alsrgv Got it!

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants