Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] switch to torchrun #22712

Merged
merged 1 commit into from
Apr 12, 2023
Merged

[tests] switch to torchrun #22712

merged 1 commit into from
Apr 12, 2023

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Apr 11, 2023

This PR fixes the following errors in nightly CI tests

FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_run_seq2seq_apex
1422
FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_run_seq2seq_ddp
1423
FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_trainer_log_level_replica_0_base
1424
FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_trainer_log_level_replica_1_low
1425
FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_trainer_log_level_replica_2_high
1426
FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_trainer_log_level_replica_3_mixed

by switching from deprecated distributed.launch to distributed.run

:   File "/workspace/transformers/examples/pytorch/translation/run_translation.py", line 664, in <module>
384
stderr:     main()
385
stderr:   File "/workspace/transformers/examples/pytorch/translation/run_translation.py", line 262, in main
386
stderr:     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
387
stderr:   File "/workspace/transformers/src/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
388
stderr:     raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
389
stderr: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=1']
390
stderr: /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
391
stderr: and will be removed in future. Use torchrun.
392
stderr: Note that --use-env is set by default in torchrun.
393
stderr: If your script expects `--local-rank` argument to be set, please
394
stderr: change it to read from `os.environ['LOCAL_RANK']` instead. See
395
stderr: https://pytorch.org/docs/stable/distributed.html#launch-utility for
396
stderr: further instructions

I updated tests/trainer/test_trainer_distributed.py while at it.

@stas00 stas00 requested a review from ydshieh April 11, 2023 22:16
@stas00 stas00 marked this pull request as ready for review April 11, 2023 22:16
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 11, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, thank you @stas00 🚀
But let me push commits for the GLIBCXX_3.4.30 things, then I will merge

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 12, 2023

Unfortunately, conda refuses to install libstdcxx-ng=12, and gives a super super long report of conflict packages after 20 or more minutes of examization.

I think we can merge this PR first. And I can try if there is anyway to make to get GLIBCXX_3.4.30 installed.
@stas00 Does this work for you?

@stas00
Copy link
Contributor Author

stas00 commented Apr 12, 2023

The GLIBCXX_3.4.30 is totally unrelated to this issue so let's deal with it separately.

@stas00 stas00 merged commit 1306b7d into main Apr 12, 2023
@stas00 stas00 deleted the torchrun branch April 12, 2023 15:25
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants