Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]模型卡在trainer.train()一直不训练 #5655

Closed
limllzu opened this issue Jun 13, 2024 · 0 comments
Closed

[BUG]模型卡在trainer.train()一直不训练 #5655

limllzu opened this issue Jun 13, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@limllzu
Copy link

limllzu commented Jun 13, 2024

Describe the bug
数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()

包环境:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 2.1.0 pypi_0 pypi
accelerate 0.30.1 pypi_0 pypi
addict 2.4.0 pypi_0 pypi
aiofiles 23.2.1 pypi_0 pypi
altair 5.3.0 pypi_0 pypi
annotated-types 0.7.0 pypi_0 pypi
anyio 4.4.0 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
binutils_impl_linux-64 2.36.1 h193b22a_2 conda-forge
binutils_linux-64 2.36 hf3e587d_10 conda-forge
bitsandbytes-cuda114 0.26.0.post2 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
blinker 1.8.2 pypi_0 pypi
blis 0.7.11 pypi_0 pypi
bzip2 1.0.8 h5eee18b_6 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ca-certificates 2024.6.2 hbcca054_0 conda-forge
cachetools 5.3.3 pypi_0 pypi
catalogue 2.0.10 pypi_0 pypi
certifi 2024.2.2 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
click 8.1.7 pypi_0 pypi
cloudpathlib 0.16.0 pypi_0 pypi
cmake 3.25.0 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
confection 0.1.5 pypi_0 pypi
contourpy 1.2.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
cymem 2.0.8 pypi_0 pypi
deepspeed 0.14.4+eda5075 pypi_0 pypi
editdistance 0.6.2 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
et-xmlfile 1.1.0 pypi_0 pypi
exceptiongroup 1.2.1 pypi_0 pypi
fairscale 0.4.0 pypi_0 pypi
fastapi 0.110.3 pypi_0 pypi
ffmpy 0.3.2 pypi_0 pypi
filelock 3.14.0 pypi_0 pypi
flask 3.0.3 pypi_0 pypi
fonttools 4.53.0 pypi_0 pypi
fsspec 2024.5.0 pypi_0 pypi
gcc_impl_linux-64 11.2.0 h82a94d6_16 conda-forge
gcc_linux-64 11.2.0 h39a9532_10 conda-forge
gpustat 1.1.1 pypi_0 pypi
gradio 4.26.0 pypi_0 pypi
gradio-client 0.15.1 pypi_0 pypi
grpcio 1.64.1 pypi_0 pypi
gxx_impl_linux-64 11.2.0 h82a94d6_16 conda-forge
gxx_linux-64 11.2.0 hacbe6df_10 conda-forge
h11 0.14.0 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
httpcore 1.0.5 pypi_0 pypi
httpx 0.27.0 pypi_0 pypi
huggingface-hub 0.23.2 pypi_0 pypi
idna 3.7 pypi_0 pypi
importlib-resources 6.4.0 pypi_0 pypi
install 1.3.5 pypi_0 pypi
itsdangerous 2.2.0 pypi_0 pypi
jinja2 3.1.4 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
jsonlines 4.0.0 pypi_0 pypi
jsonschema 4.22.0 pypi_0 pypi
jsonschema-specifications 2023.12.1 pypi_0 pypi
kernel-headers_linux-64 2.6.32 he073ed8_17 conda-forge
kiwisolver 1.4.5 pypi_0 pypi
langcodes 3.4.0 pypi_0 pypi
language-data 1.2.0 pypi_0 pypi
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libaio 0.9.3 pypi_0 pypi
libffi 3.4.4 h6a678d5_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-devel_linux-64 11.2.0 h0952999_16 conda-forge
libgcc-ng 13.2.0 h77fa898_7 conda-forge
libgomp 13.2.0 h77fa898_7 conda-forge
libsanitizer 11.2.0 he4da1e4_16 conda-forge
libstdcxx-devel_linux-64 11.2.0 h0952999_16 conda-forge
libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge
libuuid 1.41.5 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lit 15.0.7 pypi_0 pypi
lxml 5.2.2 pypi_0 pypi
marisa-trie 1.1.1 pypi_0 pypi
markdown 3.6 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markdown2 2.4.10 pypi_0 pypi
markupsafe 2.1.5 pypi_0 pypi
matplotlib 3.7.4 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
more-itertools 10.1.0 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
murmurhash 1.0.10 pypi_0 pypi
ncurses 6.4 h6a678d5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx 3.3 pypi_0 pypi
ninja 1.10.0 pypi_0 pypi
ninja-base 1.10.2 hd09550d_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
nltk 3.8.1 pypi_0 pypi
numpy 1.24.4 pypi_0 pypi
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-ml-py 12.535.161 pypi_0 pypi
nvidia-nccl-cu12 2.18.1 pypi_0 pypi
nvidia-nvjitlink-cu12 12.5.40 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
nvitop 1.3.2 pypi_0 pypi
opencv-python-headless 4.5.5.64 pypi_0 pypi
openpyxl 3.1.2 pypi_0 pypi
openssl 3.3.1 h4ab18f5_0 conda-forge
orjson 3.10.3 pypi_0 pypi
packaging 23.2 pypi_0 pypi
pandas 2.2.2 pypi_0 pypi
peft 0.11.1 pypi_0 pypi
pillow 10.1.0 pypi_0 pypi
pip 24.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
portalocker 2.8.2 pypi_0 pypi
preshed 3.0.9 pypi_0 pypi
protobuf 4.25.0 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pydantic 2.7.2 pypi_0 pypi
pydantic-core 2.18.3 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pygments 2.18.0 pypi_0 pypi
pynvml 11.5.0 pypi_0 pypi
pyparsing 3.1.2 pypi_0 pypi
pyproject 1.3.1 pypi_0 pypi
python 3.10.14 h955ad1f_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil 2.9.0.post0 pypi_0 pypi
python-multipart 0.0.9 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readline 8.2 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
referencing 0.35.1 pypi_0 pypi
regex 2024.5.15 pypi_0 pypi
requests 2.32.3 pypi_0 pypi
rich 13.7.1 pypi_0 pypi
rpds-py 0.18.1 pypi_0 pypi
ruff 0.4.7 pypi_0 pypi
sacrebleu 2.3.2 pypi_0 pypi
safetensors 0.4.3 pypi_0 pypi
seaborn 0.13.0 pypi_0 pypi
semantic-version 2.10.0 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 69.5.1 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shellingham 1.5.4 pypi_0 pypi
shortuuid 1.0.11 pypi_0 pypi
six 1.16.0 pypi_0 pypi
smart-open 6.4.0 pypi_0 pypi
sniffio 1.3.1 pypi_0 pypi
socksio 1.0.0 pypi_0 pypi
spacy 3.7.2 pypi_0 pypi
spacy-legacy 3.0.12 pypi_0 pypi
spacy-loggers 1.0.5 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
srsly 2.4.8 pypi_0 pypi
starlette 0.37.2 pypi_0 pypi
sympy 1.12.1 pypi_0 pypi
sysroot_linux-64 2.12 he073ed8_17 conda-forge
tabulate 0.9.0 pypi_0 pypi
tensorboard 2.16.2 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorboardx 1.8 pypi_0 pypi
termcolor 2.4.0 pypi_0 pypi
thinc 8.2.3 pypi_0 pypi
timm 0.9.10 pypi_0 pypi
tk 8.6.14 h39e8969_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers 0.19.1 pypi_0 pypi
tomlkit 0.12.0 pypi_0 pypi
toolz 0.12.1 pypi_0 pypi
torch 2.1.2+cu118 pypi_0 pypi
torchaudio 2.1.2+cu118 pypi_0 pypi
torchvision 0.16.2+cu118 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
transformers 4.40.0 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
typer 0.9.4 pypi_0 pypi
typing-extensions 4.8.0 pypi_0 pypi
tzdata 2024.1 pypi_0 pypi
urllib3 2.2.1 pypi_0 pypi
uvicorn 0.24.0.post1 pypi_0 pypi
wasabi 1.1.3 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
weasel 0.3.4 pypi_0 pypi
websockets 11.0.3 pypi_0 pypi
werkzeug 3.0.3 pypi_0 pypi
wheel 0.43.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz 5.4.6 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zlib 1.2.13 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

ds_report:
[2024-06-13 11:43:07,921] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-06-13 11:43:07,982] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

deepspeed_not_implemented [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]

输出情况:
prepare trainer
<class 'trainer.CPMTrainer'>
trainer ok

错误情况:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...

代码部分:
print("prepare trainer")

trainer = CPMTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)

print(type(trainer))

print("trainer ok")

trainer.train()

trainer.save_state()

print("trainer sucess")
@limllzu limllzu added bug Something isn't working training labels Jun 13, 2024
@limllzu limllzu closed this as not planned Won't fix, can't repro, duplicate, stale Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant