-
Notifications
You must be signed in to change notification settings - Fork 61
Description
🐛 Describe the bug
When do the preci test for the branch daisyden/fsdp_test I found some cases of test_fsdp_core.py got random failures, such as:
test_transformer_no_grad_mixed_precision_True_xpu
test_transformer_no_grad_mixed_precision_False_xpu
025-03-13T07:14:07.6789182Z =================================== FAILURES ===================================
2025-03-13T07:14:07.6789816Z _______ TestNoGradXPU.test_transformer_no_grad_mixed_precision_False_xpu _______
2025-03-13T07:14:07.6790144Z Traceback (most recent call last):
2025-03-13T07:14:07.6790791Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
2025-03-13T07:14:07.6791247Z self._join_processes(fn)
2025-03-13T07:14:07.6791695Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
2025-03-13T07:14:07.6792170Z self._check_return_codes(elapsed_time)
2025-03-13T07:14:07.6792640Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 902, in _check_return_codes
2025-03-13T07:14:07.6793183Z self.assertEqual(
2025-03-13T07:14:07.6793588Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4094, in assertEqual
2025-03-13T07:14:07.6794079Z raise error_metas.pop()[0].to_error( # type: ignore[index]
2025-03-13T07:14:07.6794347Z AssertionError: Scalars are not equal!
2025-03-13T07:14:07.6794486Z
2025-03-13T07:14:07.6794570Z Expected 0 but got -11.
2025-03-13T07:14:07.6794758Z Absolute difference: 11
2025-03-13T07:14:07.6794932Z Relative difference: inf
2025-03-13T07:14:07.6795197Z Expect process 1 exit code to match Process 0 exit code of 0, but got -11
2025-03-13T07:14:07.6795725Z - generated xml file: /home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/distributed/fsdp/test_fsdp_core.py.xml -
576 class MultiProcessTestCase(TestCase):
577 MAIN_PROCESS_RANK = -1
578 # This exit code is used to indicate that the test code had an error and
579 # exited abnormally. There are certain tests that might use sys.exit() to
580 # simulate failures and in those cases, we can't have an exit code of 0,
581 # but we still want to ensure we didn't run into any other errors.
582 TEST_ERROR_EXIT_CODE = 10
583
584 # do not early terminate for distributed tests.
585 def _should_stop_test_suite(self) -> bool:
586 return False
587
588 # Many test cases init a process group but do not destroy it. This property
589 # determines whether this base test class should call
590 # `destroy_process_group` on behalf of the test. Its value is customizable
591 # by derived TestCase's but it is a pan-TestCase value (cannot be customized
592 # for each test).
593 @property
594 def destroy_pg_upon_exit(self) -> bool:
595 return True
596
597 @property
598 def world_size(self) -> int:
599 return DEFAULT_WORLD_SIZE
600
601 def join_or_run(self, fn):
602 @wraps(fn)
603 def wrapper(self):
604 if self.rank == self.MAIN_PROCESS_RANK:
**605 self._join_processes(fn)**
606 else:
607 fn()
608
Versions
Torch c208f217917929a9f780a81a8c7f788b4c03ee05
Platform:
Data Center GPU Max 1100 OpenCL 3.0 NEO [25.05.32567]
libigc2 2.7.11-1099~22.04
GPU 0/0 GPU 1/0 GPU 2/0 GPU 3/0 GPU 4/0 GPU 5/0 GPU 6/0 GPU 7/0 CPU Affinity
GPU 0/0 S XL8 XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 1/0 XL8 S XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 2/0 XL8 XL8 S XL8 SYS SYS SYS SYS 0-47,96-143
GPU 3/0 XL8 XL8 XL8 S SYS SYS SYS SYS 0-47,96-143
GPU 4/0 SYS SYS SYS SYS S XL8 XL8 XL8 48-95,144-191
GPU 5/0 SYS SYS SYS SYS XL8 S XL8 XL8 48-95,144-191
GPU 6/0 SYS SYS SYS SYS XL8 XL8 S XL8 48-95,144-191
GPU 7/0 SYS SYS SYS SYS XL8 XL8 XL8 S 48-95,144-191
ZE_AFFINITY_MASK=0,1,2,3