[tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU #29307

faaany · 2024-02-27T03:09:17Z

What does this PR do?

The below test fails on intel GPU with 8 cards:

___________________________________ TrainerIntegrationTest.test_train_and_eval_dataloaders ___________________________________

self = <tests.trainer.test_trainer.TrainerIntegrationTest testMethod=test_train_and_eval_dataloaders>

    def test_train_and_eval_dataloaders(self):
        n_gpu = max(1, backend_device_count(torch_device))
        trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16)
>       self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
E       AssertionError: 16 != 128

tests/trainer/test_trainer.py:1035: AssertionError

This is because the n_gpu is 8, but the default _n_gpus in TrainingArguments is 1 as can be seen here, leading to a total_batch_size of 16. This test works on CUDA, because CUDA supports DataParallel and would set _n_gpus to a value larger than 1 as can be seen here. But on all other devices except CUDA and CPU, this test will fail.

To fix this issue, we need to add a conditional check to the test to make this test work on other devices except GPU. Pls have a review: @muellerzr and @pacman100

faaany · 2024-03-04T08:11:05Z

Hi @ArthurZucker , thanks for reviewing my other PRs. Would you like to take care of this one as well? Like other tests, this one is also written with only GPU in mind and fails on other devices like XPU. Thanks a lot!

HuggingFaceDocBuilderDev · 2024-03-04T14:30:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tests/trainer/test_trainer.py

ArthurZucker

Fine by me, cc @muellerzr feel free to merge if this is alright with you as well

faaany · 2024-03-08T03:56:49Z

@muellerzr can we proceed with this merge? thanks a lot!

muellerzr

Good for me as well!

fix n_gpu

928a00c

faaany marked this pull request as ready for review February 27, 2024 04:06

fix style

f117c76

muellerzr reviewed Mar 4, 2024

View reviewed changes

tests/trainer/test_trainer.py Show resolved Hide resolved

ArthurZucker approved these changes Mar 6, 2024

View reviewed changes

muellerzr approved these changes Mar 8, 2024

View reviewed changes

muellerzr merged commit 3f6973d into huggingface:main Mar 8, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU #29307

[tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU #29307

faaany commented Feb 27, 2024 •

edited

faaany commented Mar 4, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

ArthurZucker left a comment •

edited

faaany commented Mar 8, 2024

muellerzr left a comment

[tests] use the correct n_gpu in TrainerIntegrationTest::test_train_and_eval_dataloaders for XPU #29307

[tests] use the correct n_gpu in TrainerIntegrationTest::test_train_and_eval_dataloaders for XPU #29307

Conversation

faaany commented Feb 27, 2024 • edited

What does this PR do?

faaany commented Mar 4, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

ArthurZucker left a comment • edited

Choose a reason for hiding this comment

faaany commented Mar 8, 2024

muellerzr left a comment

Choose a reason for hiding this comment

[tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU #29307

[tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU #29307

faaany commented Feb 27, 2024 •

edited

ArthurZucker left a comment •

edited