You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f2f4244a0e0>
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11971 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11970) of binary: /root/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f2f4244a0e0>
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11971 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11970) of binary: /root/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_bash.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-01-16_14:56:20
host : 4d4cbca02479
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11970)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: