(base) ubuntu@g3-xlarge-x86-dal-1:~/steven/DHS-LLM-Workshop/chat_assistant/training$ ./run_fsdp.sh
config.json: 100%|██████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 4.42MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████| 37.6k/37.6k [00:00<00:00, 105MB/s]
Downloading shards:   0%|                                                                              | 0/7 [00:00<?, ?it/s]
model-00001-of-00007.safetensors:  80%|████████████████████████████████████████          | 7.89G/9.85G [00:45<00:08, 219MB/s]
model-00001-of-00007.safetensors:  80%|████████████████████████████████████████▏         | 7.92G/9.85G [00:45<00:15, 126MB/s]
model-00001-of-00007.safetensors: 100%|██████████████████████████████████████████████████| 9.85G/9.85G [00:58<00:00, 168MB/s]
model-00002-of-00007.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.69G/9.69G [00:55<00:00, 174MB/s]
model-00003-of-00007.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.69G/9.69G [04:26<00:00, 36.4MB/s]
Downloading shards:  43%|█████████████████████████████▌                                       | 3/7 [06:18<10:09, 152.28s/it]███████████████████████████████████████████████████████████████████████████████████████████▉| 9.69G/9.69G [04:26<00:00, 36.8MB/s]
model-00004-of-00007.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.69G/9.69G [04:36<00:00, 35.1MB/s]
model-00005-of-00007.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.69G/9.69G [04:54<00:00, 32.9MB/s]
model-00006-of-00007.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.69G/9.69G [00:54<00:00, 178MB/s]
model-00007-of-00007.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.19G/9.19G [00:52<00:00, 176MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:38<00:00, 151.15s/it]████████████████████████████████████████████████████████████████████████████████████████████▋| 9.18G/9.19G [00:52<00:00, 221MB/s]
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:37<00:00, 151.11s/it]
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:37<00:00, 151.09s/it]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:40<00:00, 151.50s/it]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:37<00:00, 151.12s/it]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:37<00:00, 151.13s/it]
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:37<00:00, 151.13s/it]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [17:38<00:00, 151.17s/it]
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.61it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.08it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.80it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.72it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.79it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.46it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.71it/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 1.28MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 749/749 [00:00<00:00, 7.68MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 332MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 9.82MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.35MB/s]
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Size of the train set: 10876. Size of the validation set: 818
  0%|                                                                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Size of the train set: 10876. Size of the validation set: 818
  0%|                                                                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Size of the train set: 10876. Size of the validation set: 818
  0%|                                                                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Size of the train set: 10876. Size of the validation set: 818
  0%|                                                                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Size of the train set: 10876. Size of the validation set: 818
  0%|                                                                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Size of the train set: 10876. Size of the validation set: 818
 44%|███████████████████████████████████████████████████████████████████████████████████████████████▏                                                                                                                      | 178/400 [00:00<00:00, 613.63it/s]Size of the train set: 10876. Size of the validation set: 818
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 608.79it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 618.00it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 621.06it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 620.66it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 621.88it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 594.02it/s]
The character to token ratio of the dataset is: 3.89
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 579.86it/s]
The character to token ratio of the dataset is: 3.89
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:29<00:00,  4.16s/it]
/home/ubuntu/anaconda3/lib/python3.11/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Size of the train set: 10876. Size of the validation set: 818
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 595.17it/s]
The character to token ratio of the dataset is: 3.89
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192)
    (layers): ModuleList(
      (0-47): 48 x LlamaDecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=22016, bias=False)
          (up_proj): Linear(in_features=8192, out_features=22016, bias=False)
          (down_proj): Linear(in_features=22016, out_features=8192, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=8192, out_features=32000, bias=False)
)
wandb: Currently logged in as: hughzhang (econcs-harvard). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/wandb/run-20231123_082057-99i6nxxw
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run sparkling-shadow-108
wandb: ⭐️ View project at https://wandb.ai/econcs-harvard/huggingface
wandb: 🚀 View run at https://wandb.ai/econcs-harvard/huggingface/runs/99i6nxxw
  0%|                                                                                                                                                                                                                                | 0/1000 [00:00<?, ?it/s]/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 1 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 5 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 6 has a total capacty of 79.19 GiB of which 944.75 MiB is free. Including non-PyTorch memory, this process has 78.26 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    return inner_training_loop(
main(args)
         main(args)
     ^^^^^^^^^^^^^^^^^^^  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    tr_loss_step = self.training_step(model, inputs)
                  return inner_training_loop(
        return inner_training_loop(
^ ^ ^  ^  ^  ^  ^   ^  ^^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
^^^^
^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
    tr_loss_step = self.training_step(model, inputs)
                          ^ ^ ^ ^ ^ ^     ^ self.accelerator.backward(loss)^
 ^ ^ ^ ^ ^^^^  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    self.accelerator.backward(loss)
    return user_fn(self, *args)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    self.accelerator.backward(loss)
             File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    loss.backward(**kwargs)torch.cuda
.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 3 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           return user_fn(self, *args)
   ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^
^^^^  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
    torch.autograd.backward(outputs_with_grad, args_with_grad)  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward

  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 2 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 7 has a total capacty of 79.19 GiB of which 984.75 MiB is free. Including non-PyTorch memory, this process has 78.22 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 4 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 0 has a total capacty of 79.19 GiB of which 1.27 GiB is free. Including non-PyTorch memory, this process has 77.92 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-11-23 08:21:22,925] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120387 closing signal SIGTERM
[2023-11-23 08:21:24,492] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 120388) of binary: /home/ubuntu/anaconda3/bin/python
wandb: 🚀 View run sparkling-shadow-108 at: https://wandb.ai/econcs-harvard/huggingface/runs/99i6nxxw
wandb: ️⚡ View job at https://wandb.ai/econcs-harvard/huggingface/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjExODA2MzUyMA==/version_details/v0
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231123_082057-99i6nxxw/logs
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 981, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 120389)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 120390)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 120391)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 120392)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 120393)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 120394)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-23_08:21:22
  host      : g3-xlarge-x86-dal-1
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 120388)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html