-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ascend 910b,chatglm2做全量微调报错 #3788
Comments
[INFO|modeling_utils.py:4170] 2024-05-20 17:25:15,119 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration. [INFO|modeling_utils.py:4178] 2024-05-20 17:25:15,119 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/.cache/modelscope/hub/ZhipuAI/chatglm3-6b. |
@hunterhome chatglm使用了torch.jit,torch-npu不支持,可以把对应的torch.jit装饰器注释掉 |
cc @belle9217 |
补充一点信息,不要在报错的 |
谢谢!
目前跑了25%时,报以下错误:
it]Traceback (most recent call last):
File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
launch()
File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
run_exp()
File "/data/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward
return self.base_model(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 941, in forward
transformer_outputs = self.transformer(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 834, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 631, in forward
layer_ret = torch.utils.checkpoint.checkpoint(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
ret = function(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 408, in forward
query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 169, in apply_rotary_pos_emb
rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
RuntimeError: shape '[13024, -1, 1, 32, 2]' is invalid for input of size 524288
[2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345687 closing signal SIGTERM
[2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345688 closing signal SIGTERM
[2024-05-31 13:24:51,108] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345690 closing signal SIGTERM
[2024-05-31 13:24:51,110] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345691 closing signal SIGTERM
[2024-05-31 13:24:51,114] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345692 closing signal SIGTERM
[2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345693 closing signal SIGTERM
[2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345694 closing signal SIGTERM
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Exception in thread Thread-2:
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
File "<string>", line 2, in get
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
[2024-05-31 13:24:58,118] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,170] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,173] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,188] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,196] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,210] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,218] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,236] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,246] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,277] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,306] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,400] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,423] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,451] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,457] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,464] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,498] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,529] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,541] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,546] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,615] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,622] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,644] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,659] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,723] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,741] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,742] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,776] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,783] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,804] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,857] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:58,896] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:59,163] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:59,178] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:24:59,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
[2024-05-31 13:25:21,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345687 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:21,409] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345688 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:21,743] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345690 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:22,067] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345691 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:22,353] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345692 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:22,754] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345693 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:23,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345694 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-05-31 13:25:23,720] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 1345689) of binary: /data/anaconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/data/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-31_13:24:51
host : localhost.localdomain
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1345689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
…------------------ 原始邮件 ------------------
发件人: "hiyouga/LLaMA-Factory" ***@***.***>;
发送时间: 2024年5月28日(星期二) 中午1:39
***@***.***>;
***@***.******@***.***>;
主题: Re: [hiyouga/LLaMA-Factory] ascend 910b,chatglm2做全量微调报错 (Issue #3788)
Closed #3788 as completed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
谢谢! |
Reminder
Reproduction
bug 如下图
Expected behavior
No response
System Info
No response
Others
The text was updated successfully, but these errors were encountered: