Chaos by user case by not chaos-blade. [2024-04-24 06:38:53,888] [INFO] [elastic_run.py:268:_check_dlrover_master_available] DLRover job master starts! INFO:root:Using nproc_per_node=2. WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2024-04-24 06:38:53,889] [INFO] [training.py:688:launch_agent] Starting elastic_operator with launch configs: entrypoint : examples/pytorch/mnist/cnn_train.py min_nodes : 3 max_nodes : 4 nproc_per_node : 2 run_id : chaos-test rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'pend_timeout': '600', 'rank': 0, 'timeout': 900, 'node_unit': 1} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} [2024-04-24 06:38:53,889] [INFO] [master_client.py:422:build_master_client] Build master client with addr elasticjob-chaos-test-dlrover-master:50001. [2024-04-24 06:38:53,897] [INFO] [master_client.py:75:__init__] dlrover master addr is elasticjob-chaos-test-dlrover-master:50001 [2024-04-24 06:38:53,898] [WARNING] [resource.py:136:init_gpu_monitor] NVIDIA NVML library not found. GPU monitoring features will be disabled. [2024-04-24 06:38:53,898] [INFO] [resource.py:102:start] Resource Monitor Initializing ... [2024-04-24 06:38:53,898] [INFO] [resource.py:177:_monitor_resource] Start to monitor resource usage [2024-04-24 06:38:53,898] [INFO] [resource.py:112:start] Resource Monitor initialized successfully [2024-04-24 06:38:53,907] [INFO] [ckpt_saver.py:430:_factory] Start the checkpoint saver factory. [2024-04-24 06:38:53,907] [INFO] [training.py:552:_invoke_run] [dlrover-trainer] starting workers for entrypoint: python [2024-04-24 06:38:53,907] [INFO] [training.py:953:network_check] Starting elastic_operator with launch configs: entrypoint : dlrover.trainer.torch.node_check.nvidia_gpu min_nodes : 3 max_nodes : 4 nproc_per_node : 2 run_id : chaos-test rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'pend_timeout': '600', 'rank': 0, 'timeout': 900, 'node_unit': 1} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} [2024-04-24 06:38:53,908] [INFO] [training.py:827:run] [dlrover-trainer] starting workers for entrypoint: python [2024-04-24 06:38:53,908] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous network-check with timeout 600. [2024-04-24 06:38:56,914] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 1 of the network-check rendezvous as rank 0 in a world of size 2. [2024-04-24 06:38:56,918] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.6 master_port=26563 group_rank=0 group_world_size=2 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. [2024-04-24 06:39:01,056] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 0.01. [2024-04-24 06:39:01,060] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 0.02. [2024-04-24 06:39:08,900] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 7.84. [2024-04-24 06:39:08,905] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 7.84. [2024-04-24 06:39:08,921] [INFO] [nvidia_gpu.py:44:] Finish checking node in 8.75s. [2024-04-24 06:39:08,928] [INFO] [nvidia_gpu.py:44:] Finish checking node in 8.92s. [2024-04-24 06:39:11,982] [INFO] [training.py:837:run] Network check time of round 0 is 8.92 and succeed is True. [2024-04-24 06:39:11,985] [INFO] [training.py:850:run] Fault nodes are: [] and stragglers are: []. [2024-04-24 06:39:11,985] [INFO] [training.py:1006:network_check] Network check result is True [2024-04-24 06:39:11,986] [INFO] [training.py:1026:run_network_check] Node check passed. [2024-04-24 06:39:11,986] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous elastic-training with timeout 600. [2024-04-24 06:39:17,997] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 1 of the elastic-training rendezvous as rank 0 in a world of size 4. [2024-04-24 06:39:18,000] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.6 master_port=23168 group_rank=0 group_world_size=4 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[8, 8] global_world_sizes=[8, 8] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. rank 0 is initialized local_rank = 0 INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. rank 1 is initialized local_rank = 1 [2024-04-24 06:39:24,810] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. [2024-04-24 06:39:24,813] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. [2024-04-24 06:39:24,907] [INFO] [engine.py:43:_local_rank0_log] Create a gloo commumication group to save checkpoint. Saving ranks are [0, 2, 4, 6]. [2024-04-24 06:39:24,915] [INFO] [ckpt_saver.py:395:__init__] Initialize the AsyncSaver with arguments: checkpoint_dir=/nas/mnist-ckpt/, local_shard_num=1, global_shard_num=1, [2024-04-24 06:39:24,915] [INFO] [ckpt_saver.py:511:_sync_shm_to_storage] Async flash checkpoint saver starts! INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. [2024-04-24 06:39:25,909] [INFO] [ckpt_saver.py:515:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1. [2024-04-24 06:39:25,921] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 1, World size = 8, Gradient accumulation steps = 1 [2024-04-24 06:39:25,921] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 0, World size = 8, Gradient accumulation steps = 1 INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. Active exception trigger. Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in train(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error raise error_list[random.randint(0, len(error_list) - 1)] IndexError: E2 loss = 2.1627330780029297, step = 20 Active exception trigger. Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in train(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error raise error_list[random.randint(0, len(error_list) - 1)] TypeError: E3 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48) of binary: /usr/local/bin/python [2024-04-24 06:39:33,037] [ERROR] [training.py:587:_invoke_run] The worker fails with {0: ProcessFailure(local_rank=0, pid=48, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_0/0/error.json', error_file_data={'message': {'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940769'}}}, message={'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940769'}}, timestamp=1713940769), 1: ProcessFailure(local_rank=1, pid=49, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_0/1/error.json', error_file_data={'message': {'message': 'IndexError: E2', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nIndexError: E2\n', 'timestamp': '1713940769'}}}, message={'message': 'IndexError: E2', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nIndexError: E2\n', 'timestamp': '1713940769'}}, timestamp=1713940769)} [2024-04-24 06:39:33,038] [INFO] [training.py:591:_invoke_run] [dlrover-trainer] Worker group FAILED. 3/3 attempts left; will restart worker group [2024-04-24 06:39:33,038] [INFO] [ckpt_saver.py:711:reset] Reset all shared memory of shards. [2024-04-24 06:39:33,039] [INFO] [training.py:953:network_check] Starting elastic_operator with launch configs: entrypoint : dlrover.trainer.torch.node_check.nvidia_gpu min_nodes : 3 max_nodes : 4 nproc_per_node : 2 run_id : chaos-test rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'pend_timeout': '600', 'rank': 0, 'timeout': 900, 'node_unit': 1} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} [2024-04-24 06:39:33,042] [INFO] [training.py:827:run] [dlrover-trainer] starting workers for entrypoint: python [2024-04-24 06:39:33,042] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous network-check with timeout 600. [2024-04-24 06:39:36,048] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 3 of the network-check rendezvous as rank 0 in a world of size 2. [2024-04-24 06:39:36,051] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.6 master_port=22506 group_rank=0 group_world_size=2 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. [2024-04-24 06:39:39,732] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 0.01. [2024-04-24 06:39:39,733] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 0.01. [2024-04-24 06:39:50,684] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 10.95. [2024-04-24 06:39:50,686] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 10.95. [2024-04-24 06:39:50,705] [INFO] [nvidia_gpu.py:44:] Finish checking node in 12.03s. [2024-04-24 06:39:50,705] [INFO] [nvidia_gpu.py:44:] Finish checking node in 12.09s. [2024-04-24 06:39:54,087] [INFO] [training.py:837:run] Network check time of round 0 is 12.09 and succeed is True. [2024-04-24 06:39:54,090] [INFO] [training.py:850:run] Fault nodes are: [] and stragglers are: []. [2024-04-24 06:39:54,090] [INFO] [training.py:1006:network_check] Network check result is True [2024-04-24 06:39:54,090] [INFO] [training.py:1026:run_network_check] Node check passed. [2024-04-24 06:39:54,090] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous elastic-training with timeout 600. [2024-04-24 06:40:00,102] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 2 of the elastic-training rendezvous as rank 0 in a world of size 4. [2024-04-24 06:40:00,105] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=1 master_addr=192.168.0.6 master_port=25592 group_rank=0 group_world_size=4 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[8, 8] global_world_sizes=[8, 8] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. rank 1 is initialized local_rank = 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. rank 0 is initialized local_rank = 0 [2024-04-24 06:40:07,706] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. [2024-04-24 06:40:07,708] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. [2024-04-24 06:40:07,791] [INFO] [engine.py:43:_local_rank0_log] Create a gloo commumication group to save checkpoint. Saving ranks are [0, 2, 4, 6]. [2024-04-24 06:40:07,792] [INFO] [ckpt_saver.py:445:_factory] The saver is already created, skip creating the saver. [2024-04-24 06:40:07,792] [INFO] [ckpt_saver.py:515:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1. [2024-04-24 06:40:07,801] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 1, World size = 8, Gradient accumulation steps = 1 [2024-04-24 06:40:07,801] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 0, World size = 8, Gradient accumulation steps = 1 INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. Active exception trigger. loss = 1.8460348844528198, step = 20 Active exception trigger. Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in train(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error raise error_list[random.randint(0, len(error_list) - 1)] ValueError: E0 Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in train(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error raise error_list[random.randint(0, len(error_list) - 1)] TypeError: E3 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 89) of binary: /usr/local/bin/python [2024-04-24 06:40:15,143] [ERROR] [training.py:587:_invoke_run] The worker fails with {0: ProcessFailure(local_rank=0, pid=89, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_1/0/error.json', error_file_data={'message': {'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940813'}}}, message={'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940813'}}, timestamp=1713940813), 1: ProcessFailure(local_rank=1, pid=90, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_1/1/error.json', error_file_data={'message': {'message': 'ValueError: E0', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nValueError: E0\n', 'timestamp': '1713940813'}}}, message={'message': 'ValueError: E0', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nValueError: E0\n', 'timestamp': '1713940813'}}, timestamp=1713940813)} [2024-04-24 06:40:15,145] [INFO] [training.py:591:_invoke_run] [dlrover-trainer] Worker group FAILED. 2/3 attempts left; will restart worker group [2024-04-24 06:40:15,145] [INFO] [ckpt_saver.py:711:reset] Reset all shared memory of shards. [2024-04-24 06:40:15,145] [INFO] [training.py:953:network_check] Starting elastic_operator with launch configs: entrypoint : dlrover.trainer.torch.node_check.nvidia_gpu min_nodes : 3 max_nodes : 4 nproc_per_node : 2 run_id : chaos-test rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'pend_timeout': '600', 'rank': 0, 'timeout': 900, 'node_unit': 1} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} [2024-04-24 06:40:15,146] [INFO] [training.py:827:run] [dlrover-trainer] starting workers for entrypoint: python [2024-04-24 06:40:15,146] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous network-check with timeout 600. [2024-04-24 06:40:18,152] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 5 of the network-check rendezvous as rank 0 in a world of size 2. [2024-04-24 06:40:18,155] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.6 master_port=28569 group_rank=0 group_world_size=2 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. [2024-04-24 06:40:23,435] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 0.01. [2024-04-24 06:40:23,473] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 0.05. [2024-04-24 06:40:33,898] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 10.42. [2024-04-24 06:40:33,902] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 10.47. [2024-04-24 06:40:33,916] [INFO] [nvidia_gpu.py:44:] Finish checking node in 11.04s. [2024-04-24 06:40:33,970] [INFO] [nvidia_gpu.py:44:] Finish checking node in 10.68s. [2024-04-24 06:40:36,198] [INFO] [training.py:837:run] Network check time of round 0 is 11.04 and succeed is True. [2024-04-24 06:40:41,206] [INFO] [training.py:850:run] Fault nodes are: [] and stragglers are: []. [2024-04-24 06:40:41,206] [INFO] [training.py:1006:network_check] Network check result is True [2024-04-24 06:40:41,207] [INFO] [training.py:1026:run_network_check] Node check passed. [2024-04-24 06:40:41,207] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous elastic-training with timeout 600. [2024-04-24 06:40:41,209] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 3 of the elastic-training rendezvous as rank 0 in a world of size 4. [2024-04-24 06:40:41,212] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=2 master_addr=192.168.0.6 master_port=22600 group_rank=0 group_world_size=4 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[8, 8] global_world_sizes=[8, 8] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. rank 0 is initialized local_rank = 0 rank 1 is initialized local_rank = 1 [2024-04-24 06:40:49,280] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. [2024-04-24 06:40:49,285] [INFO] [full_ckpt_engine.py:79:get_saving_ranks] The ranks to save checkpoint are [0, 2, 4, 6]. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:2 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. [2024-04-24 06:40:49,338] [INFO] [engine.py:43:_local_rank0_log] Create a gloo commumication group to save checkpoint. Saving ranks are [0, 2, 4, 6]. [2024-04-24 06:40:49,339] [INFO] [ckpt_saver.py:445:_factory] The saver is already created, skip creating the saver. [2024-04-24 06:40:49,339] [INFO] [ckpt_saver.py:515:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1. [2024-04-24 06:40:49,371] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 0, World size = 8, Gradient accumulation steps = 1 [2024-04-24 06:40:49,371] [INFO] [trainer.py:322:_set_gradient_accumulation_steps] Rank = 1, World size = 8, Gradient accumulation steps = 1 INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. loss = 2.1073391437530518, step = 20 Active exception trigger. Active exception trigger. Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in Traceback (most recent call last): File "examples/pytorch/mnist/cnn_train.py", line 338, in train(args)train(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train return f(*args, **kwargs) File "examples/pytorch/mnist/cnn_train.py", line 184, in train step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch step = train_epoch( File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error generate_error() File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error raise error_list[random.randint(0, len(error_list) - 1)] TypeError: E3 raise error_list[random.randint(0, len(error_list) - 1)] IndexError: E4 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 118) of binary: /usr/local/bin/python [2024-04-24 06:40:56,281] [ERROR] [training.py:587:_invoke_run] The worker fails with {0: ProcessFailure(local_rank=0, pid=118, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_2/0/error.json', error_file_data={'message': {'message': 'IndexError: E4', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nIndexError: E4\n', 'timestamp': '1713940853'}}}, message={'message': 'IndexError: E4', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nIndexError: E4\n', 'timestamp': '1713940853'}}, timestamp=1713940853), 1: ProcessFailure(local_rank=1, pid=119, exitcode=1, error_file='/tmp/torchelastic_aqfiol1t/attempt_2/1/error.json', error_file_data={'message': {'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940853'}}}, message={'message': 'TypeError: E3', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper\n return f(*args, **kwargs)\n File "examples/pytorch/mnist/cnn_train.py", line 184, in train\n step = train_epoch(\n File "examples/pytorch/mnist/cnn_train.py", line 240, in train_epoch\n generate_error()\n File "examples/pytorch/mnist/cnn_train.py", line 96, in generate_error\n raise error_list[random.randint(0, len(error_list) - 1)]\nTypeError: E3\n', 'timestamp': '1713940853'}}, timestamp=1713940853)} [2024-04-24 06:40:56,283] [INFO] [training.py:591:_invoke_run] [dlrover-trainer] Worker group FAILED. 1/3 attempts left; will restart worker group [2024-04-24 06:40:56,283] [INFO] [ckpt_saver.py:711:reset] Reset all shared memory of shards. [2024-04-24 06:40:56,283] [INFO] [training.py:953:network_check] Starting elastic_operator with launch configs: entrypoint : dlrover.trainer.torch.node_check.nvidia_gpu min_nodes : 3 max_nodes : 4 nproc_per_node : 2 run_id : chaos-test rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'pend_timeout': '600', 'rank': 0, 'timeout': 900, 'node_unit': 1} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} [2024-04-24 06:40:56,285] [INFO] [training.py:827:run] [dlrover-trainer] starting workers for entrypoint: python [2024-04-24 06:40:56,285] [INFO] [training.py:251:next_rendezvous] The node chaos-test-edljob-worker-0 with rank 0 attempts to join the next round of the rendezvous network-check with timeout 600. [2024-04-24 06:40:59,290] [INFO] [training.py:286:next_rendezvous] The node chaos-test-edljob-worker-0 has joined round 7 of the network-check rendezvous as rank 0 in a world of size 2. [2024-04-24 06:40:59,293] [INFO] [training.py:423:_rendezvous] [dlrover-trainer] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.6 master_port=20909 group_rank=0 group_world_size=2 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. [2024-04-24 06:41:03,241] [INFO] [utils.py:43:wrapper] Time to execute on local rank 0 is 0.01. [2024-04-24 06:41:03,242] [INFO] [utils.py:43:wrapper] Time to execute on local rank 1 is 0.01.