Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is nan, stopping training, while trying to reproduce alpaca_finetuning_v1 results. #144

Open
NavaneethNidadavolu opened this issue Feb 8, 2024 · 1 comment

Comments

@NavaneethNidadavolu
Copy link

I'm using 2 NVIDIA GeForce RTX 3090 GPUs. (Memory: 24576MiB each)

I'm try to fine-tune the model with alpaca_data which is provided to replicated the results of this paper. i've put batch size as 2 because 4 gives me this error:

torch.cuda.OutOfMemoryError: CUDA out of memory.

Command:

!OMP_NUM_THREADS=8 torchrun --nproc_per_node 2 finetuning.py \
    --model Llama7B_adapter \
    --llama_model_path /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/ \
    --data_path ./alpaca_data.json \
    --adapter_layer 30 \
    --adapter_len 10 \
    --max_seq_len 512 \
    --batch_size 2 \
    --epochs 5 \
    --warmup_epochs 2 \
    --blr 9e-3 \
    --weight_decay 0.02 \
    --output_dir ./checkpoint/

Logs:

| distributed init (rank 0): env://, gpu 0
| distributed init (rank 1): env://, gpu 1
[17:10:32.992362] job dir: /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/alpaca_finetuning_v1
[17:10:32.992448] Namespace(batch_size=2,
epochs=5,
accum_iter=1,
llama_model_path='/home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/',
model='Llama7B_adapter',
adapter_layer=30,
adapter_len=10,
max_seq_len=512,
weight_decay=0.02,
lr=None,
blr=0.009,
min_lr=0.0,
warmup_epochs=2,
data_path='./alpaca_data.json',
output_dir='./checkpoint/',
log_dir='./output_dir',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=10,
pin_mem=True,
world_size=2,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
rank=0,
gpu=0,
distributed=True,
dist_backend='nccl')
[17:10:33.198872] <__main__.InstructionDataset object at 0x7fdcb9bd85e0>
[17:10:33.198914] <__main__.InstructionDataset object at 0x7fdc2dab6da0>
[17:10:33.198952] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7fdc2dab6d70>
[17:10:39.061337] /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/7B/consolidated.00.pth
/home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
[17:10:44.010087] Model = Transformer(
  (tok_embeddings): Embedding(32000, 4096)
  (adapter_query): Embedding(300, 4096)
  (criterion): CrossEntropyLoss()
  (layers): ModuleList(
    (0-31): 32 x TransformerBlock(
      (attention): Attention(
        (wq): Linear(in_features=4096, out_features=4096, bias=False)
        (wk): Linear(in_features=4096, out_features=4096, bias=False)
        (wv): Linear(in_features=4096, out_features=4096, bias=False)
        (wo): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=4096, out_features=11008, bias=False)
        (w2): Linear(in_features=11008, out_features=4096, bias=False)
        (w3): Linear(in_features=4096, out_features=11008, bias=False)
      )
      (attention_norm): RMSNorm()
      (ffn_norm): RMSNorm()
    )
  )
  (norm): RMSNorm()
  (output): Linear(in_features=4096, out_features=32000, bias=False)
)
[17:10:44.010150] base lr: 9.00e-03
[17:10:44.010164] actual lr: 1.41e-04
[17:10:44.010175] accumulate grad iterations: 1
[17:10:44.010185] effective batch size: 4
[17:10:46.081588] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.000140625
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.000140625
    maximize: False
    weight_decay: 0.02
)
[17:10:46.081681] Start training for 5 epochs
[17:10:46.082621] log_dir: ./output_dir
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[17:10:47.481772] Epoch: [0]  [    0/13000]  eta: 5:03:00  lr: 0.000000  closs: 1.5186 (1.5186)  time: 1.3985  data: 0.3370  max mem: 20596
[17:10:53.607369] Epoch: [0]  [   10/13000]  eta: 2:28:04  lr: 0.000000  closs: 1.3838 (1.5683)  time: 0.6840  data: 0.0307  max mem: 20618
[17:10:59.724177] Epoch: [0]  [   20/13000]  eta: 2:20:30  lr: 0.000000  closs: 1.3838 (1.6151)  time: 0.6121  data: 0.0001  max mem: 20618
[17:11:05.859398] Epoch: [0]  [   30/13000]  eta: 2:17:53  lr: 0.000000  closs: 1.5078 (1.6011)  time: 0.6125  data: 0.0001  max mem: 20618
[17:11:12.016458] Epoch: [0]  [   40/13000]  eta: 2:16:36  lr: 0.000000  closs: 1.5508 (1.6249)  time: 0.6146  data: 0.0001  max mem: 20618
[17:11:18.185472] Epoch: [0]  [   50/13000]  eta: 2:15:50  lr: 0.000000  closs: 1.5430 (1.6123)  time: 0.6163  data: 0.0001  max mem: 20618
[17:11:24.369595] Epoch: [0]  [   60/13000]  eta: 2:15:21  lr: 0.000000  closs: 1.4756 (1.6185)  time: 0.6176  data: 0.0001  max mem: 20618
[17:11:30.565958] Epoch: [0]  [   70/13000]  eta: 2:15:00  lr: 0.000000  closs: 1.4609 (1.5902)  time: 0.6190  data: 0.0001  max mem: 20618
[17:11:36.784317] Epoch: [0]  [   80/13000]  eta: 2:14:46  lr: 0.000000  closs: 1.4434 (1.5902)  time: 0.6207  data: 0.0001  max mem: 20618
[17:11:43.015473] Epoch: [0]  [   90/13000]  eta: 2:14:36  lr: 0.000000  closs: 1.5264 (1.5991)  time: 0.6224  data: 0.0001  max mem: 20618
[17:11:49.250349] Epoch: [0]  [  100/13000]  eta: 2:14:27  lr: 0.000001  closs: 1.5889 (1.6096)  time: 0.6232  data: 0.0001  max mem: 20618
[17:11:55.489798] Epoch: [0]  [  110/13000]  eta: 2:14:19  lr: 0.000001  closs: 1.6016 (1.6066)  time: 0.6237  data: 0.0001  max mem: 20618
[17:12:01.741387] Epoch: [0]  [  120/13000]  eta: 2:14:12  lr: 0.000001  closs: 1.4668 (1.6057)  time: 0.6245  data: 0.0001  max mem: 20618
[17:12:07.992549] Epoch: [0]  [  130/13000]  eta: 2:14:06  lr: 0.000001  closs: 1.5322 (1.6070)  time: 0.6251  data: 0.0001  max mem: 20618
[17:12:14.251213] Epoch: [0]  [  140/13000]  eta: 2:14:00  lr: 0.000001  closs: 1.5322 (1.6222)  time: 0.6254  data: 0.0001  max mem: 20618
[17:12:20.515632] Epoch: [0]  [  150/13000]  eta: 2:13:55  lr: 0.000001  closs: 1.5615 (1.6169)  time: 0.6261  data: 0.0001  max mem: 20618
[17:12:26.784347] Epoch: [0]  [  160/13000]  eta: 2:13:50  lr: 0.000001  closs: 1.3867 (1.6352)  time: 0.6266  data: 0.0001  max mem: 20618
[17:12:33.056327] Epoch: [0]  [  170/13000]  eta: 2:13:45  lr: 0.000001  closs: 1.3955 (1.6270)  time: 0.6270  data: 0.0001  max mem: 20618
[17:12:39.323410] Epoch: [0]  [  180/13000]  eta: 2:13:39  lr: 0.000001  closs: 1.4873 (1.6272)  time: 0.6269  data: 0.0001  max mem: 20618
[17:12:45.595449] Epoch: [0]  [  190/13000]  eta: 2:13:34  lr: 0.000001  closs: 1.3467 (1.6120)  time: 0.6269  data: 0.0001  max mem: 20618
[17:12:51.868360] Epoch: [0]  [  200/13000]  eta: 2:13:29  lr: 0.000001  closs: 1.3467 (1.6073)  time: 0.6272  data: 0.0001  max mem: 20618
[17:12:58.145625] Epoch: [0]  [  210/13000]  eta: 2:13:24  lr: 0.000001  closs: 1.3662 (1.6042)  time: 0.6275  data: 0.0001  max mem: 20618
[17:13:04.422560] Epoch: [0]  [  220/13000]  eta: 2:13:19  lr: 0.000001  closs: 1.4150 (1.5964)  time: 0.6277  data: 0.0001  max mem: 20618
[17:13:10.699249] Epoch: [0]  [  230/13000]  eta: 2:13:13  lr: 0.000001  closs: 1.4043 (1.6040)  time: 0.6276  data: 0.0001  max mem: 20618
[17:13:16.974919] Epoch: [0]  [  240/13000]  eta: 2:13:08  lr: 0.000001  closs: 1.4766 (1.6143)  time: 0.6276  data: 0.0001  max mem: 20618
[17:13:23.254382] Epoch: [0]  [  250/13000]  eta: 2:13:03  lr: 0.000001  closs: 1.6289 (1.6176)  time: 0.6277  data: 0.0001  max mem: 20618
[17:13:29.533807] Epoch: [0]  [  260/13000]  eta: 2:12:57  lr: 0.000001  closs: 1.4199 (1.6184)  time: 0.6279  data: 0.0001  max mem: 20618
[17:13:35.815718] Epoch: [0]  [  270/13000]  eta: 2:12:52  lr: 0.000001  closs: 1.3975 (1.6182)  time: 0.6280  data: 0.0001  max mem: 20618
[17:13:42.094205] Epoch: [0]  [  280/13000]  eta: 2:12:46  lr: 0.000002  closs: 1.5469 (1.6180)  time: 0.6280  data: 0.0001  max mem: 20618
[17:13:48.380554] Epoch: [0]  [  290/13000]  eta: 2:12:41  lr: 0.000002  closs: 1.5586 (1.6233)  time: 0.6282  data: 0.0001  max mem: 20618
[17:13:54.670131] Epoch: [0]  [  300/13000]  eta: 2:12:36  lr: 0.000002  closs: 1.3896 (1.6176)  time: 0.6287  data: 0.0001  max mem: 20618
[17:13:59.988396] Loss is nan, stopping training -> train_one_epoch
[2024-02-07 17:14:05,032] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 130839 closing signal SIGTERM
[2024-02-07 17:14:05,146] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 130838) of binary: /home/navaneeth/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/navaneeth/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetuning.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-07_17:14:05
  host      : sjsu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 130838)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

What can i do to solve this?

@EmilyGirl
Copy link

Did you solve it? How did you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants