-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
训练配置如下:
model
model_name_or_path: /raid/zhanghang02/weights/MiniCPM-V-2_6
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true
method
stage: dpo
do_train: true
finetuning_type: lora
freeze_vision_tower: true
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo]
dataset
dataset: dpo_test_video
template: minicpm_v
cutoff_len: 256
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 1
output
output_dir: saves/minicpmv/lora/dpo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 300.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
报错如下:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it]
[INFO|modeling_utils.py:4888] 2025-05-26 16:46:08,068 >> All model checkpoint weights were used when initializing MiniCPMV.
[INFO|modeling_utils.py:4896] 2025-05-26 16:46:08,069 >> All the weights of MiniCPMV were initialized from the model checkpoint at /raid/zhanghang02/weights/MiniCPM-V-2_6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MiniCPMV for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-26 16:46:08,156 >> loading configuration file /raid/zhanghang02/weights/MiniCPM-V-2_6/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-26 16:46:08,156 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,v_proj,up_proj,k_proj,o_proj,down_proj,gate_proj
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vpm'].
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: resampler.
[INFO|2025-05-26 16:46:08] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 8,119,360,240 || trainable%: 0.2486
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:741] 2025-05-26 16:46:09,007 >> Using auto half precision backend
[INFO|trainer.py:2369] 2025-05-26 16:46:09,246 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-05-26 16:46:09,246 >> Num examples = 109
[INFO|trainer.py:2371] 2025-05-26 16:46:09,246 >> Num Epochs = 300
[INFO|trainer.py:2372] 2025-05-26 16:46:09,246 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2375] 2025-05-26 16:46:09,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2376] 2025-05-26 16:46:09,246 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2377] 2025-05-26 16:46:09,246 >> Total optimization steps = 32,700
[INFO|trainer.py:2378] 2025-05-26 16:46:09,250 >> Number of trainable parameters = 20,185,088
0%| | 0/32700 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main
COMMAND_MAPcommand
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 78, in _training_function
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 80, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 133, in get_batch_samples
return Trainer.get_batch_samples(self, *args, **kwargs)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 5153, in get_batch_samples
batch_samples += [next(epoch_iterator)]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 264, in call
return super().call(concatenated_features)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 157, in call
mm_inputs = self.template.mm_plugin.get_mm_inputs(
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 1080, in get_mm_inputs
image_bounds = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 3 but got size 2 for tensor number 1 in the list.
0%| | 0/32700 [00:00<?, ?it/s]
Reproduction
Put your message here.
Others
No response