Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it seems there is a version problem please help me #333

Open
K-Alex13 opened this issue Jan 16, 2024 · 0 comments
Open

it seems there is a version problem please help me #333

K-Alex13 opened this issue Jan 16, 2024 · 0 comments

Comments

@K-Alex13
Copy link

RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f2f4244a0e0>
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11971 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11970) of binary: /root/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-01-16_14:56:20
host : 4d4cbca02479
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11970)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant