-
Couldn't load subscription status.
- Fork 928
Description
训练命令:
MASTER_PORT=29607
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
swift sft
--model /date0/crwu/models/Qwen3-VL-235B-A22B-Instruct/
--dataset /date0/crwu/test_data/info/10K_images_data/extracted_food_images_train_easy_prompt_warm.json
--val_dataset /date0/crwu/test_data/info/10K_images_data/extracted_food_images_val_9000pics_easy_prompt.json
--packing true
--attn_impl flash_attn
--model_type qwen3_moe_vl
--torch_dtype bfloat16
--num_train_epochs 3
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--learning_rate 5e-6
--train_type lora
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.1
--gradient_accumulation_steps 1
--eval_steps 25
--save_steps 25
--save_total_limit 3
--logging_steps 1
--max_length 4096
--dataloader_drop_last true
--output_dir output/20250926153124_10K_images_data_train_50_lora8_alpha16_lr5e-6_dropout_0.1_Lora_target_modules_TEST
--warmup_steps 100
--dataloader_num_workers 32
--dataset_num_proc 128
--save_only_model true
--use_liger_kernel true
--swanlab_mode cloud
--early_stop_interval 6
--deepspeed zero3
--max_steps 10
--gradient_checkpointing_kwargs '{"use_reentrant": false}'
报错:
换了deepspeed0.16.9以后报这个错了
W0926 19:27:27.271000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837808 closing signal SIGTERM
W0926 19:27:27.273000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837809 closing signal SIGTERM
W0926 19:27:27.279000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837810 closing signal SIGTERM
W0926 19:27:27.291000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837811 closing signal SIGTERM
W0926 19:27:27.294000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837813 closing signal SIGTERM
W0926 19:27:27.303000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837814 closing signal SIGTERM
W0926 19:27:27.308000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1837815 closing signal SIGTERM
E0926 19:27:34.475000 1837730 miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 4 (pid: 1837812) of binary: /date0/crwu/miniconda3/envs/ms-swift-qwen3/bin/python3.10
Traceback (most recent call last):
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in
main()
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/date0/crwu/miniconda3/envs/ms-swift-qwen3/lib/python3.10/site-packages/swift/cli/sft.py FAILED
环境依赖:
pip list
Package Version
absl-py 2.3.1
accelerate 1.10.1
addict 2.4.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.11.0
asttokens 3.0.0
async-timeout 5.0.1
attrdict 2.0.1
attrs 25.3.0
av 15.1.0
binpacking 1.5.2
boto3 1.40.38
botocore 1.40.38
Brotli 1.1.0
cachetools 5.5.2
certifi 2025.8.3
cffi 2.0.0
charset-normalizer 3.4.3
click 8.3.0
colorama 0.4.6
colorlog 6.9.0
contourpy 1.3.2
cpm-kernels 1.0.11
crcmod 1.7
cryptography 46.0.1
cycler 0.12.1
dacite 1.9.2
datasets 3.6.0
decorator 5.2.1
decord 0.6.0
deepspeed 0.16.9
dill 0.3.8
distro 1.9.0
docstring_parser 0.17.0
dotenv 0.9.9
einops 0.8.1
et_xmlfile 2.0.0
evalscope 1.0.2
evaluate 0.4.6
exceptiongroup 1.3.0
executing 2.2.1
fastapi 0.117.1
ffmpy 0.6.1
filelock 3.13.1
fire 0.7.1
flash_attn 2.8.3
fonttools 4.60.0
frozenlist 1.7.0
fsspec 2024.6.1
func_timeout 4.3.5
future 1.0.0
fuzzywuzzy 0.18.0
google-auth 2.40.3
google-genai 1.38.0
gradio 5.38.2
gradio_client 1.11.0
groovy 0.1.2
grpcio 1.75.0
h11 0.16.0
h5py 3.14.0
hf-xet 1.1.10
hjson 3.1.0
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 1.0.0rc1
human-eval 1.0.3
idna 3.10
imageio 2.37.0
immutabledict 4.2.1
importlib_metadata 8.7.0
ipdb 0.13.13
ipython 8.37.0
jedi 0.19.2
jieba 0.42.1
Jinja2 3.1.4
jiter 0.11.0
jmespath 0.10.0
joblib 1.5.2
json_repair 0.51.0
json5 0.12.1
jsonlines 4.0.0
kiwisolver 1.4.9
langdetect 1.0.9
latex2sympy2_extended 1.10.2
Levenshtein 0.27.1
liger_kernel 0.6.2
lxml 6.0.2
Markdown 3.9
markdown-it-py 4.0.0
MarkupSafe 2.1.5
matplotlib 3.10.6
matplotlib-inline 0.1.7
mdurl 0.1.2
mmengine-lite 0.10.7
modelscope 1.30.0
mpmath 1.3.0
ms-opencompass 0.1.6
ms_swift 3.9.0.dev0
ms-vlmeval 0.0.18
msgpack 1.1.1
multidict 6.6.4
multiprocess 0.70.16
networkx 3.3
ninja 1.13.0
nltk 3.9.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 13.580.82
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.1.105
openai 1.109.1
OpenCC 1.1.9
opencv-python 4.11.0.86
openpyxl 3.1.5
orjson 3.11.3
oss2 2.19.1
overrides 7.7.0
packaging 25.0
pandas 2.3.2
parso 0.8.5
peft 0.17.1
pexpect 4.9.0
pillow 11.0.0
pip 25.2
platformdirs 4.4.0
portalocker 3.2.0
prettytable 3.16.0
prompt_toolkit 3.0.52
propcache 0.3.2
protobuf 6.32.1
psutil 7.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyarrow 21.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycparser 2.23
pycryptodome 3.23.0
pydantic 2.11.9
pydantic_core 2.33.2
pydub 0.25.1
pyecharts 2.0.8
Pygments 2.19.2
pynvml 13.0.1
pyparsing 3.2.5
pypinyin 0.55.0
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-Levenshtein 0.27.1
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
qwen-vl-utils 0.0.14
rank-bm25 0.2.2
RapidFuzz 3.14.1
regex 2025.9.18
requests 2.32.5
rich 13.9.4
rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2
rsa 4.9.1
ruff 0.13.2
s3transfer 0.14.0
sacrebleu 2.5.1
safehttpx 0.1.6
safetensors 0.6.2
scikit-learn 1.7.2
scipy 1.15.3
seaborn 0.13.2
semantic-version 2.10.0
sentence-transformers 5.1.1
sentencepiece 0.2.1
setuptools 78.1.1
shellingham 1.5.4
simplejson 3.20.1
six 1.17.0
sniffio 1.3.1
sortedcontainers 2.4.0
stack-data 0.6.3
starlette 0.48.0
sty 1.0.6
swankit 0.2.4
swanlab 0.6.10
sympy 1.13.1
tabulate 0.9.0
tenacity 9.1.2
tensorboard 2.20.0
tensorboard-data-server 0.7.2
termcolor 3.1.0
threadpoolctl 3.6.0
tiktoken 0.11.0
timeout-decorator 0.5.0
timm 1.0.20
tokenizers 0.22.1
tomli 2.2.1
tomlkit 0.13.3
torch 2.5.1+cu121
torchaudio 2.5.1+cu121
torchvision 0.20.1+cu121
tqdm 4.67.1
traitlets 5.14.3
transformers 4.57.0.dev0
transformers-stream-generator 0.0.5
triton 3.1.0
trl 0.20.0
typer 0.19.2
typer-slim 0.19.2
typing_extensions 4.15.0
typing-inspection 0.4.1
tzdata 2025.2
urllib3 2.5.0
uvicorn 0.37.0
validators 0.35.0
wcwidth 0.2.14
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
word2number 1.1
wrapt 1.17.3
xlsxwriter 3.2.9
xxhash 3.5.0
yapf 0.43.0
yarl 1.20.1
zipp 3.23.0
zstandard 0.25.0