-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
I am trying to finetune qwen2.5-vl on 16 * 80G GPUS, and I use LLaMA-Factory and set preprocessing_num_workers=16. However, I met the following error and the program seem to got crush. It seems that the error come from datasets library
The error logging is like following:
Converting format of dataset (num_proc=16): 100%|█████████▉| 19265/19267 [11:44<00:00, 5.88 examples/s]
Converting format of dataset (num_proc=16): 100%|█████████▉| 19266/19267 [11:44<00:00, 5.02 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00, 5.44 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00, 27.34 examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [00:00<?, ? examples/s]
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Others
No response
Steps to reproduce the bug
None
Expected behavior
excpect to run successfully
Environment info
transformers==4.49.0
datasets==3.2.0
accelerate==1.2.1
peft==0.12.0
trl==0.9.6
tokenizers==0.21.0
gradio>=4.38.0,<=5.18.0
pandas>=2.0.0
scipy
einops
sentencepiece
tiktoken
protobuf
uvicorn
pydantic
fastapi
sse-starlette
matplotlib>=3.7.0
fire
packaging
pyyaml
numpy<2.0.0
av
librosa
tyro<0.9.0
openlm-hub
qwen-vl-utils
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels