Skip to content

视频DPO训练报错 #8157

@zhanghang-official

Description

@zhanghang-official

Reminder

  • I have read the above rules and searched the existing issues.

System Info

训练配置如下:

model

model_name_or_path: /raid/zhanghang02/weights/MiniCPM-V-2_6
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

method

stage: dpo
do_train: true
finetuning_type: lora

freeze_vision_tower: true

lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo]

dataset

dataset: dpo_test_video
template: minicpm_v
cutoff_len: 256
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 1

output

output_dir: saves/minicpmv/lora/dpo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 300.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500

报错如下:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it]
[INFO|modeling_utils.py:4888] 2025-05-26 16:46:08,068 >> All model checkpoint weights were used when initializing MiniCPMV.

[INFO|modeling_utils.py:4896] 2025-05-26 16:46:08,069 >> All the weights of MiniCPMV were initialized from the model checkpoint at /raid/zhanghang02/weights/MiniCPM-V-2_6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MiniCPMV for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-26 16:46:08,156 >> loading configuration file /raid/zhanghang02/weights/MiniCPM-V-2_6/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-26 16:46:08,156 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,v_proj,up_proj,k_proj,o_proj,down_proj,gate_proj
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vpm'].
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: resampler.
[INFO|2025-05-26 16:46:08] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 8,119,360,240 || trainable%: 0.2486
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:741] 2025-05-26 16:46:09,007 >> Using auto half precision backend
[INFO|trainer.py:2369] 2025-05-26 16:46:09,246 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-05-26 16:46:09,246 >> Num examples = 109
[INFO|trainer.py:2371] 2025-05-26 16:46:09,246 >> Num Epochs = 300
[INFO|trainer.py:2372] 2025-05-26 16:46:09,246 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2375] 2025-05-26 16:46:09,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2376] 2025-05-26 16:46:09,246 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2377] 2025-05-26 16:46:09,246 >> Total optimization steps = 32,700
[INFO|trainer.py:2378] 2025-05-26 16:46:09,250 >> Number of trainable parameters = 20,185,088
0%| | 0/32700 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main
COMMAND_MAPcommand
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 78, in _training_function
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 80, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 133, in get_batch_samples
return Trainer.get_batch_samples(self, *args, **kwargs)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 5153, in get_batch_samples
batch_samples += [next(epoch_iterator)]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 264, in call
return super().call(concatenated_features)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 157, in call
mm_inputs = self.template.mm_plugin.get_mm_inputs(
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 1080, in get_mm_inputs
image_bounds = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 3 but got size 2 for tensor number 1 in the list.

0%| | 0/32700 [00:00<?, ?it/s]

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingpendingThis problem is yet to be addressed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions