-
Notifications
You must be signed in to change notification settings - Fork 62
Closed
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue
Milestone
Description
🐛 Describe the bug
please get wheels from https://github.com/intel/torch-xpu-ops/actions/runs/18029979174 or use gh download
gh run download 18029979174 --repo intel/torch-xpu-ops --name Torch-XPU-Wheel-1826 --dir path --pattern "*.zip"
git clone -b distributed_2.9 https://github.com/daisyden/pytorch.git
cd pytorch
pip install -r requirements.txt
cd test/distributed/elastic/rendezvous/
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_gets_backend_state_if_cached_state_has_expired
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_gets_backend_state_if_local_state_is_clean
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_gets_backend_state_if_local_state_is_old_and_dirty
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_sanitizes_state
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_sanitizes_state_if_no_participants_is_left
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_sync_uses_cached_state_if_cache_duration_is_specified
python -m unittest dynamic_rendezvous_test.BackendRendezvousStateHolderTest.test_keep_alive_updates_last_heartbeat
python -m unittest dynamic_rendezvous_test.TestRendezvousJoinOp.test_keep_alive_for_redundant_node
python -m unittest dynamic_rendezvous_test.TestRendezvousJoinOp.test_marks_rendezvous_complete_if_node_is_participant_and_last_call_deadline_exceeded
python -m unittest dynamic_rendezvous_test.TestRendezvousJoinOp.test_waits_next_round_if_rendezvous_is_complete_and_node_is_in_wait_list
python -m unittest dynamic_rendezvous_test.TestRendezvousJoinOp.test_waits_next_round_if_rendezvous_is_complete_and_node_is_redundant
python -m unittest dynamic_rendezvous_test.TestRendezvousJoinOp.test_waits_rendezvous_to_complete_if_node_is_participant
python -m unittest dynamic_rendezvous_test.TestRendezvousKeepAliveOp.test_finishes_if_no_keep_alive_update_is_needed
python -m unittest dynamic_rendezvous_test.TestRendezvousKeepAliveOp.test_raises_timeout_if_deadlined_exceeded
python -m unittest dynamic_rendezvous_test.TestRendezvousKeepAliveOp.test_updates_keep_alive_if_needed
ERROR: test_sync_gets_backend_state_if_cached_state_has_expired (dynamic_rendezvous_test.BackendRendezvousStateHolderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/sdp/xiangdong/pytorch/test/distributed/elastic/rendezvous/dynamic_rendezvous_test.py", line 427, in test_sync_gets_backend_state_if_cached_state_has_expired
state_holder.sync()
File "/home/sdp/miniforge-pypy3/envs/xccl_ww27/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 467, in sync
self._sanitize()
File "/home/sdp/miniforge-pypy3/envs/xccl_ww27/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 479, in _sanitize
self._dead_nodes = [
File "/home/sdp/miniforge-pypy3/envs/xccl_ww27/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 482, in <listcomp>
if last_heartbeat < expire_time
TypeError: '<' not supported between instances of 'datetime.datetime' and 'MagicMock'
Versions
pytorch: https://github.com/daisyden/pytorch/tree/distributed_2.9
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue