Skip to content

Error splitting the input into NAL units. #7427

@MengHao666

Description

@MengHao666

Describe the bug

I am trying to finetune qwen2.5-vl on 16 * 80G GPUS, and I use LLaMA-Factory and set preprocessing_num_workers=16. However, I met the following error and the program seem to got crush. It seems that the error come from datasets library

The error logging is like following:

Converting format of dataset (num_proc=16): 100%|█████████▉| 19265/19267 [11:44<00:00,  5.88 examples/s]
Converting format of dataset (num_proc=16): 100%|█████████▉| 19266/19267 [11:44<00:00,  5.02 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00,  5.44 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00, 27.34 examples/s]

Running tokenizer on dataset (num_proc=16):   0%|          | 0/19267 [00:00<?, ? examples/s]
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

Running tokenizer on dataset (num_proc=16):   0%|          | 0/19267 [13:22<?, ? examples/s]    self.run()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
    Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
    t.start()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
    Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).

RuntimeError: can't start new threadError splitting the input into NAL units.

Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

Running tokenizer on dataset (num_proc=16):   0%|          | 0/19267 [13:22<?, ? examples/s]    self.run()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
    Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
    t.start()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
    Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).

RuntimeError: can't start new threadError splitting the input into NAL units.

Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

Running tokenizer on dataset (num_proc=16):   0%|          | 0/19267 [13:22<?, ? examples/s]    self.run()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
    Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
    t.start()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
    Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).

RuntimeError: can't start new threadError splitting the input into NAL units.

Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

Running tokenizer on dataset (num_proc=16):   0%|          | 0/19267 [13:22<?, ? examples/s]    self.run()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
    Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
    t.start()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
    Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).

RuntimeError: can't start new threadError splitting the input into NAL units.

Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.

Others

No response

Steps to reproduce the bug

None

Expected behavior

excpect to run successfully

Environment info

transformers==4.49.0
datasets==3.2.0
accelerate==1.2.1
peft==0.12.0
trl==0.9.6
tokenizers==0.21.0
gradio>=4.38.0,<=5.18.0
pandas>=2.0.0
scipy
einops
sentencepiece
tiktoken
protobuf
uvicorn
pydantic
fastapi
sse-starlette
matplotlib>=3.7.0
fire
packaging
pyyaml
numpy<2.0.0
av
librosa
tyro<0.9.0
openlm-hub
qwen-vl-utils

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions