Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] use the correct n_gpu in TrainerIntegrationTest::test_train_and_eval_dataloaders for XPU #29307

Merged
merged 2 commits into from Mar 8, 2024

Conversation

faaany
Copy link
Contributor

@faaany faaany commented Feb 27, 2024

What does this PR do?

The below test fails on intel GPU with 8 cards:

___________________________________ TrainerIntegrationTest.test_train_and_eval_dataloaders ___________________________________

self = <tests.trainer.test_trainer.TrainerIntegrationTest testMethod=test_train_and_eval_dataloaders>

    def test_train_and_eval_dataloaders(self):
        n_gpu = max(1, backend_device_count(torch_device))
        trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16)
>       self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
E       AssertionError: 16 != 128

tests/trainer/test_trainer.py:1035: AssertionError

This is because the n_gpu is 8, but the default _n_gpus in TrainingArguments is 1 as can be seen here, leading to a total_batch_size of 16. This test works on CUDA, because CUDA supports DataParallel and would set _n_gpus to a value larger than 1 as can be seen here. But on all other devices except CUDA and CPU, this test will fail.

To fix this issue, we need to add a conditional check to the test to make this test work on other devices except GPU. Pls have a review: @muellerzr and @pacman100

@faaany faaany marked this pull request as ready for review February 27, 2024 04:06
@faaany
Copy link
Contributor Author

faaany commented Mar 4, 2024

Hi @ArthurZucker , thanks for reviewing my other PRs. Would you like to take care of this one as well? Like other tests, this one is also written with only GPU in mind and fails on other devices like XPU. Thanks a lot!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, cc @muellerzr feel free to merge if this is alright with you as well

@faaany
Copy link
Contributor Author

faaany commented Mar 8, 2024

@muellerzr can we proceed with this merge? thanks a lot!

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me as well!

@muellerzr muellerzr merged commit 3f6973d into huggingface:main Mar 8, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants