(base) ubuntu@g3-xlarge-x86-dal-1:~/steven/DHS-LLM-Workshop/chat_assistant/training$ ./run_fsdp.sh config.json: 100%|██████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 4.42MB/s] model.safetensors.index.json: 100%|██████████████████████████████████████████████████████| 37.6k/37.6k [00:00<00:00, 105MB/s] Downloading shards: 0%| | 0/7 [00:00' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( Size of the train set: 10876. Size of the validation set: 818 0%| | 0/400 [00:00' instead. warnings.warn( Size of the train set: 10876. Size of the validation set: 818 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 595.17it/s] The character to token ratio of the dataset is: 3.89 LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32000, 8192) (layers): ModuleList( (0-47): 48 x LlamaDecoderLayer( (self_attn): LlamaFlashAttention2( (q_proj): Linear(in_features=8192, out_features=8192, bias=False) (k_proj): Linear(in_features=8192, out_features=1024, bias=False) (v_proj): Linear(in_features=8192, out_features=1024, bias=False) (o_proj): Linear(in_features=8192, out_features=8192, bias=False) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): Linear(in_features=8192, out_features=22016, bias=False) (up_proj): Linear(in_features=8192, out_features=22016, bias=False) (down_proj): Linear(in_features=22016, out_features=8192, bias=False) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm() (post_attention_layernorm): LlamaRMSNorm() ) ) (norm): LlamaRMSNorm() ) (lm_head): Linear(in_features=8192, out_features=32000, bias=False) ) wandb: Currently logged in as: hughzhang (econcs-harvard). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.16.0 wandb: Run data is saved locally in /home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/wandb/run-20231123_082057-99i6nxxw wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run sparkling-shadow-108 wandb: ⭐️ View project at https://wandb.ai/econcs-harvard/huggingface wandb: 🚀 View run at https://wandb.ai/econcs-harvard/huggingface/runs/99i6nxxw 0%| | 0/1000 [00:00 main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 1 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 5 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 6 has a total capacty of 79.19 GiB of which 944.75 MiB is free. Including non-PyTorch memory, this process has 78.26 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() Traceback (most recent call last): Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in return inner_training_loop( main(args) main(args) ^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main ^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train tr_loss_step = self.training_step(model, inputs) return inner_training_loop( return inner_training_loop( ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop ^^^^ ^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) tr_loss_step = self.training_step(model, inputs) ^ ^ ^ ^ ^ ^ ^ self.accelerator.backward(loss)^ ^ ^ ^ ^ ^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step ^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply self.accelerator.backward(loss) return user_fn(self, *args) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass loss.backward(**kwargs)torch.cuda .OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 3 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) return user_fn(self, *args) ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^ ^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward ^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 2 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 7 has a total capacty of 79.19 GiB of which 984.75 MiB is free. Including non-PyTorch memory, this process has 78.22 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 4 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in main(args) File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main trainer.train() File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step self.accelerator.backward(loss) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward loss.backward(**kwargs) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 0 has a total capacty of 79.19 GiB of which 1.27 GiB is free. Including non-PyTorch memory, this process has 77.92 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-11-23 08:21:22,925] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120387 closing signal SIGTERM [2023-11-23 08:21:24,492] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 120388) of binary: /home/ubuntu/anaconda3/bin/python wandb: 🚀 View run sparkling-shadow-108 at: https://wandb.ai/econcs-harvard/huggingface/runs/99i6nxxw wandb: ️⚡ View job at https://wandb.ai/econcs-harvard/huggingface/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjExODA2MzUyMA==/version_details/v0 wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20231123_082057-99i6nxxw/logs Traceback (most recent call last): File "/home/ubuntu/anaconda3/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 981, in launch_command multi_gpu_launcher(args) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher distrib_run.run(args) File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 2 (local_rank: 2) exitcode : 1 (pid: 120389) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 3 (local_rank: 3) exitcode : 1 (pid: 120390) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 4 (local_rank: 4) exitcode : 1 (pid: 120391) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 5 (local_rank: 5) exitcode : 1 (pid: 120392) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 6 (local_rank: 6) exitcode : 1 (pid: 120393) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 7 (local_rank: 7) exitcode : 1 (pid: 120394) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-23_08:21:22 host : g3-xlarge-x86-dal-1 rank : 1 (local_rank: 1) exitcode : 1 (pid: 120388) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html