Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsdp_qlora fail #3907

Closed
1 task done
etemiz opened this issue May 26, 2024 · 5 comments
Closed
1 task done

fsdp_qlora fail #3907

etemiz opened this issue May 26, 2024 · 5 comments
Labels
solved This problem has been already solved

Comments

@etemiz
Copy link

etemiz commented May 26, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

bash examples/extras/fsdp_qlora/single_node.sh

After reinstalling LLaMA-Factory with the latest commits without changing anything, I ran the above script. Which does sft to llama3-8b. It didn't work. One of the processes seemed to shut down during validation:

***** train metrics *****
  epoch                    =     2.9817
  total_flos               = 22902485GF
  train_loss               =     0.9921
  train_runtime            = 1:41:22.36
  train_samples_per_second =      0.484
  train_steps_per_second   =       0.03
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
05/26/2024 11:54:51 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
[INFO|trainer.py:3719] 2024-05-26 11:54:51,665 >> ***** Running Evaluation *****
[INFO|trainer.py:3721] 2024-05-26 11:54:51,665 >>   Num examples = 110
[INFO|trainer.py:3724] 2024-05-26 11:54:51,665 >>   Batch size = 1
 51%|█████████████████████████████████████████████████████████████████████████████████████████                                                                                      | 28/55 [00:32<00:31,  1.18s/it]W0526 11:55:29.603000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 92950 closing signal SIGTERM
E0526 11:55:30.569000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 92949) of binary: /home/dead/Desktop/ml/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
src/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_11:55:29
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 92949)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 92949
=======================================================

Expected behavior

do training

System Info

GPUs: 2*RTX 3090

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented May 26, 2024

try disabling evaluation after training?

@hiyouga hiyouga added the pending This problem is yet to be addressed label May 26, 2024
@etemiz
Copy link
Author

etemiz commented May 26, 2024

how do I do that?

@hiyouga
Copy link
Owner

hiyouga commented May 26, 2024

remove the eval args in yaml config

@etemiz
Copy link
Author

etemiz commented May 26, 2024

same thing happened this time during training..

{'loss': 1.0392, 'grad_norm': 1.531546711921692, 'learning_rate': 8.078577175829324e-05, 'epoch': 0.88}                                                                                                             
 32%|██████████████████████████████████████████████████████▊                                                                                                                     | 65/204 [39:24<1:23:19, 35.97s/it]W0526 16:36:23.858000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 101092 closing signal SIGTERM
E0526 16:36:24.875000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 101091) of binary: /home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
src/train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_16:36:23
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 101091)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 101091
========================================================

@etemiz
Copy link
Author

etemiz commented May 27, 2024

I ran it again. It decided to run this time without an issue. Evals disabled.

training_loss

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels May 28, 2024
@hiyouga hiyouga closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants