fsdp_qlora fail #3907

etemiz · 2024-05-26T17:02:00Z

Reminder

I have read the README and searched the existing issues.

Reproduction

bash examples/extras/fsdp_qlora/single_node.sh

After reinstalling LLaMA-Factory with the latest commits without changing anything, I ran the above script. Which does sft to llama3-8b. It didn't work. One of the processes seemed to shut down during validation:

***** train metrics *****
  epoch                    =     2.9817
  total_flos               = 22902485GF
  train_loss               =     0.9921
  train_runtime            = 1:41:22.36
  train_samples_per_second =      0.484
  train_steps_per_second   =       0.03
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
05/26/2024 11:54:51 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
[INFO|trainer.py:3719] 2024-05-26 11:54:51,665 >> ***** Running Evaluation *****
[INFO|trainer.py:3721] 2024-05-26 11:54:51,665 >>   Num examples = 110
[INFO|trainer.py:3724] 2024-05-26 11:54:51,665 >>   Batch size = 1
 51%|█████████████████████████████████████████████████████████████████████████████████████████                                                                                      | 28/55 [00:32<00:31,  1.18s/it]W0526 11:55:29.603000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 92950 closing signal SIGTERM
E0526 11:55:30.569000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 92949) of binary: /home/dead/Desktop/ml/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
src/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_11:55:29
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 92949)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 92949
=======================================================

Expected behavior

do training

System Info

GPUs: 2*RTX 3090

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-05-26T17:07:14Z

try disabling evaluation after training?

etemiz · 2024-05-26T17:13:31Z

how do I do that?

hiyouga · 2024-05-26T18:10:32Z

remove the eval args in yaml config

etemiz · 2024-05-26T21:58:43Z

same thing happened this time during training..

{'loss': 1.0392, 'grad_norm': 1.531546711921692, 'learning_rate': 8.078577175829324e-05, 'epoch': 0.88}                                                                                                             
 32%|██████████████████████████████████████████████████████▊                                                                                                                     | 65/204 [39:24<1:23:19, 35.97s/it]W0526 16:36:23.858000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 101092 closing signal SIGTERM
E0526 16:36:24.875000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 101091) of binary: /home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
src/train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_16:36:23
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 101091)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 101091
========================================================

etemiz · 2024-05-27T17:27:43Z

I ran it again. It decided to run this time without an issue. Evals disabled.

hiyouga added the pending This problem is yet to be addressed label May 26, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels May 28, 2024

hiyouga closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsdp_qlora fail #3907

fsdp_qlora fail #3907

etemiz commented May 26, 2024 •

edited

Loading

hiyouga commented May 26, 2024

etemiz commented May 26, 2024

hiyouga commented May 26, 2024

etemiz commented May 26, 2024

etemiz commented May 27, 2024 •

edited

Loading

fsdp_qlora fail #3907

fsdp_qlora fail #3907

Comments

etemiz commented May 26, 2024 • edited Loading

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented May 26, 2024

etemiz commented May 26, 2024

hiyouga commented May 26, 2024

etemiz commented May 26, 2024

etemiz commented May 27, 2024 • edited Loading

etemiz commented May 26, 2024 •

edited

Loading

etemiz commented May 27, 2024 •

edited

Loading