Skip to content

[distributed] Some test_fsdp_core.py cases got random failures in _join_processes(fn) #1475

@daisyden

Description

@daisyden

🐛 Describe the bug

When do the preci test for the branch daisyden/fsdp_test I found some cases of test_fsdp_core.py got random failures, such as:
test_transformer_no_grad_mixed_precision_True_xpu
test_transformer_no_grad_mixed_precision_False_xpu

025-03-13T07:14:07.6789182Z =================================== FAILURES ===================================
2025-03-13T07:14:07.6789816Z _______ TestNoGradXPU.test_transformer_no_grad_mixed_precision_False_xpu _______
2025-03-13T07:14:07.6790144Z Traceback (most recent call last):
2025-03-13T07:14:07.6790791Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
2025-03-13T07:14:07.6791247Z     self._join_processes(fn)
2025-03-13T07:14:07.6791695Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
2025-03-13T07:14:07.6792170Z     self._check_return_codes(elapsed_time)
2025-03-13T07:14:07.6792640Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 902, in _check_return_codes
2025-03-13T07:14:07.6793183Z     self.assertEqual(
2025-03-13T07:14:07.6793588Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4094, in assertEqual
2025-03-13T07:14:07.6794079Z     raise error_metas.pop()[0].to_error(  # type: ignore[index]
2025-03-13T07:14:07.6794347Z AssertionError: Scalars are not equal!
2025-03-13T07:14:07.6794486Z 
2025-03-13T07:14:07.6794570Z Expected 0 but got -11.
2025-03-13T07:14:07.6794758Z Absolute difference: 11
2025-03-13T07:14:07.6794932Z Relative difference: inf
2025-03-13T07:14:07.6795197Z Expect process 1 exit code to match Process 0 exit code of 0, but got -11
2025-03-13T07:14:07.6795725Z - generated xml file: /home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/distributed/fsdp/test_fsdp_core.py.xml -

 576 class MultiProcessTestCase(TestCase):
 577     MAIN_PROCESS_RANK = -1
 578     # This exit code is used to indicate that the test code had an error and
 579     # exited abnormally. There are certain tests that might use sys.exit() to
 580     # simulate failures and in those cases, we can't have an exit code of 0,
 581     # but we still want to ensure we didn't run into any other errors.
 582     TEST_ERROR_EXIT_CODE = 10
 583
 584     # do not early terminate for distributed tests.
 585     def _should_stop_test_suite(self) -> bool:
 586         return False
 587
 588     # Many test cases init a process group but do not destroy it.  This property
 589     # determines whether this base test class should call
 590     # `destroy_process_group` on behalf of the test. Its value is customizable
 591     # by derived TestCase's but it is a pan-TestCase value (cannot be customized
 592     # for each test).
 593     @property
 594     def destroy_pg_upon_exit(self) -> bool:
 595         return True
 596
 597     @property
 598     def world_size(self) -> int:
 599         return DEFAULT_WORLD_SIZE
 600
 601     def join_or_run(self, fn):
 602         @wraps(fn)
 603         def wrapper(self):
 604             if self.rank == self.MAIN_PROCESS_RANK:
 **605                 self._join_processes(fn)**
 606             else:
 607                 fn()
 608

Versions

Torch c208f217917929a9f780a81a8c7f788b4c03ee05

Platform:

Data Center GPU Max 1100 OpenCL 3.0 NEO [25.05.32567]

libigc2 2.7.11-1099~22.04

 GPU 0/0  GPU 1/0  GPU 2/0  GPU 3/0  GPU 4/0  GPU 5/0  GPU 6/0  GPU 7/0  CPU Affinity

GPU 0/0 S XL8 XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 1/0 XL8 S XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 2/0 XL8 XL8 S XL8 SYS SYS SYS SYS 0-47,96-143
GPU 3/0 XL8 XL8 XL8 S SYS SYS SYS SYS 0-47,96-143
GPU 4/0 SYS SYS SYS SYS S XL8 XL8 XL8 48-95,144-191
GPU 5/0 SYS SYS SYS SYS XL8 S XL8 XL8 48-95,144-191
GPU 6/0 SYS SYS SYS SYS XL8 XL8 S XL8 48-95,144-191
GPU 7/0 SYS SYS SYS SYS XL8 XL8 XL8 S 48-95,144-191

ZE_AFFINITY_MASK=0,1,2,3

Metadata

Metadata

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions