-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS Neuron torchrun support #20806
Add AWS Neuron torchrun support #20806
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore
is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>
@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for # Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
Trainer._wrap_model = lambda self, model, training=True: model
else:
Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model
# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16 |
Yes for this test we will need Trainium instance. Over time, once pytorch/xla#3609 is released, we can make it more generic for GPU/XLA. For now, Neuron team will test this. Test is currently passing on Trainium instance. |
The first workaround is for missing DDP support which will be available in Neuron's PyTorch-XLA version 1.13 (future release). The second workaround is already fixed in transformers==4.25.1 by #20562. |
Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then? |
@sgugger since we already have a workaround for DDP wrapper by overwriting the _wrap_model function, we can actually merge this first. The reason is that 1) we want it in for next transformer release ahead of 1.13, and 2) I will need this change to post another PR for the default compiler flag for transformer model type. Let me know if this is acceptable. |
Thanks for your patience on this. |
* Add XLA torchrun support * Clarify that currently DDP doesn't work with torch.distributed XLA backend yet * Enable DDP with torchrun and XLA (now available in PT-XLA 1.13) * Add check for AWS Neuron availability and AWS Neuron specific compiler flag * Change the new test's name to TestTrainerDistributedNeuronCore * Remove "assert" and replace raised exception * Remove compiler flag as it is optional. If needed, will be another PR. * Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
* Add XLA torchrun support * Clarify that currently DDP doesn't work with torch.distributed XLA backend yet * Enable DDP with torchrun and XLA (now available in PT-XLA 1.13) * Add check for AWS Neuron availability and AWS Neuron specific compiler flag * Change the new test's name to TestTrainerDistributedNeuronCore * Remove "assert" and replace raised exception * Remove compiler flag as it is optional. If needed, will be another PR. * Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
* Add XLA torchrun support * Clarify that currently DDP doesn't work with torch.distributed XLA backend yet * Enable DDP with torchrun and XLA (now available in PT-XLA 1.13) * Add check for AWS Neuron availability and AWS Neuron specific compiler flag * Change the new test's name to TestTrainerDistributedNeuronCore * Remove "assert" and replace raised exception * Remove compiler flag as it is optional. If needed, will be another PR. * Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
What does this PR do?
This PR adds support for torchrun for AWS Neuron SDK.
Existing HF tutorial for Neuron SDK requires users to modify the HF example script (ie run_glue.py). This change will help minimize the changes required.
This change will require future AWS Neuron PyTorch 1.13 support.
This is an update to #19907 .
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger