Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minicpm-v 全参数finetune报错 #677

Closed
SeanLiaoy opened this issue Apr 9, 2024 · 4 comments
Closed

minicpm-v 全参数finetune报错 #677

SeanLiaoy opened this issue Apr 9, 2024 · 4 comments
Assignees
Labels
bug Something isn't working solved

Comments

@SeanLiaoy
Copy link

SeanLiaoy commented Apr 9, 2024

Describe the bug
训练几步后报错,相同的数据集用来训练qwen-vl-chat正常,启动训练命令如下

NPROC_PER_NODE=8 \
MASTER_PORT=29500 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --sft_type full \
    --model_type minicpm-v-3b-chat\
    --model_id_or_path ${BASE_MODEL_PATH} \
    --check_model_is_latest false \
    --dtype fp16 \
    --num_train_epochs 2 \
    --custom_train_dataset_path ${TRAIN_DATASET} \
    --custom_val_dataset_path ${VAL_DATASET} \
    --train_dataset_sample -1 \
    --max_length 400 \
    --learning_rate 1e-5 \
    --batch_size 8 \
    --gradient_accumulation_steps 8 \
    --output_dir ${SAVE_PATH} \
    --deepspeed 'default-zero2' \
    --eval_steps 100 \
    --save_steps 200 \
    --save_total_limit 2

Your hardware and system info

Additional context

Time to load fused_adam op: 0.10169339179992676 seconds
{'loss': 4.241786, 'acc': 0.43225819, 'learning_rate': 0.0, 'epoch': 0.0, 'global_step': 1}                                                                                                                                                                              
{'loss': 3.26653814, 'acc': 0.47446066, 'learning_rate': 4.42e-06, 'epoch': 0.01, 'global_step': 5}                                                                                                                                                                      
{'loss': 2.87632065, 'acc': 0.48180504, 'learning_rate': 6.33e-06, 'epoch': 0.03, 'global_step': 10}                                                                                                                                                                     
{'loss': 2.60134201, 'acc': 0.53103466, 'learning_rate': 7.44e-06, 'epoch': 0.04, 'global_step': 15}                                                                                                                                                                     
Train:   2%|█████▌                                                                                                                                                                                                                      | 19/760 [01:08<44:15,  3.58s/it]Traceback (most recent call last):
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main
    result = llm_x(args, **kwargs)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/llm/sft.py", line 236, in llm_sft
    trainer.train(training_args.resume_from_checkpoint)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/trainers/trainers.py", line 50, in train
    res = super().train(*args, **kwargs)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/accelerate/data_loader.py", line 462, in __iter__
    next_batch = next(dataloader_iter)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/llm/utils/utils.py", line 200, in __getitem__
    res = self._try_fetch(idx)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/llm/utils/utils.py", line 210, in _try_fetch
    res = self.template.encode(data)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/swift/llm/utils/template.py", line 1115, in encode
    pixel_values = self.model.transform(image)[None].to(
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 95, in __call__
    img = t(img)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 277, in forward
    return F.normalize(tensor, self.mean, self.std, self.inplace)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torchvision/transforms/functional.py", line 363, in normalize
    return F_t.normalize(tensor, mean=mean, std=std, inplace=inplace)
  File "/data/miniconda3/envs/env-3.10.6/lib/python3.10/site-packages/torchvision/transforms/_functional_tensor.py", line 928, in normalize
    return tensor.sub_(mean).div_(std)
RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0
@SeanLiaoy SeanLiaoy changed the title minicpm-v 全参数报错 minicpm-v 全参数finetune报错 Apr 9, 2024
@SeanLiaoy
Copy link
Author

猜测是visual部分编码存在问题导致缺了一个维度,由于对swift框架的训练逻辑缺乏了解,我基于huggingface trainer+deepspeed重新实现了minicpm-v的全参数微调,并在我的环境中成功运行,参考MR:
OpenBMB/MiniCPM-V#36

@zhuliyi0
Copy link

zhuliyi0 commented Apr 12, 2024

微调cogvlm遇到类似错误,正好发生在一个比较小的validation loop,所以把问题数据找到了。附件可以复现这个错误。

设置:
CUDA_VISIBLE_DEVICES=0,1
swift sft
--model_type cogvlm-17b-instruct
--sft_type lora
--tuner_backend swift
--dtype bf16
--train_dataset_sample -1
--max_steps 6000
--max_length 2048
--check_dataset_strategy warning
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--lora_target_modules DEFAULT
--lora_dtype bf16
--quantization_bit 0
--bnb_4bit_comp_dtype AUTO
--gradient_checkpointing false
--batch_size 1
--weight_decay 0.1
--learning_rate 1e-04
--lr_scheduler_type linear
--gradient_accumulation_steps 1
--max_grad_norm 0.5
--warmup_ratio 0.1
--eval_steps 200
--save_steps 500
--save_total_limit -1
--logging_steps 25
--add_output_dir_suffix false

错误:

Traceback (most recent call last):
File "/root/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/root/swift/swift/utils/run_utils.py", line 31, in x_main
result = llm_x(args, **kwargs)
File "/root/swift/swift/llm/sft.py", line 236, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/root/swift/swift/trainers/trainers.py", line 50, in train
res = super().train(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/root/swift/swift/trainers/mixin.py", line 590, in _maybe_log_save_evaluate
super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2412, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 166, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3229, in evaluate
output = eval_loop(
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3408, in evaluation_loop
for step, inputs in enumerate(dataloader):
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/data_loader.py", line 462, in iter
next_batch = next(dataloader_iter)
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/root/miniconda3/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/swift/swift/llm/utils/utils.py", line 200, in getitem
res = self._try_fetch(idx)
File "/root/swift/swift/llm/utils/utils.py", line 210, in _try_fetch
res = self.template.encode(data)
File "/root/swift/swift/llm/utils/template.py", line 1024, in encode
inputs2 = model.build_conversation_input_ids(
File "/root/.cache/huggingface/modules/transformers_modules/cogvlm-chat/modeling_cogvlm.py", line 818, in build_conversation_input_ids
images = [transform(images[0])]
File "/root/miniconda3/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 95, in call
img = t(img)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 277, in forward
return F.normalize(tensor, self.mean, self.std, self.inplace)
File "/root/miniconda3/lib/python3.10/site-packages/torchvision/transforms/functional.py", line 349, in normalize
return F_t.normalize(tensor, mean=mean, std=std, inplace=inplace)
File "/root/miniconda3/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py", line 926, in normalize
return tensor.sub
(mean).div
(std)
RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0

Downloads.zip

@Jintao-Huang Jintao-Huang self-assigned this Apr 14, 2024
@Jintao-Huang Jintao-Huang added the bug Something isn't working label Apr 14, 2024
@Jintao-Huang
Copy link
Collaborator

可以复现 我修复一下哈

@chuangzhidan
Copy link

所以现在可以正常全参数微调了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working solved
Projects
None yet
Development

No branches or pull requests

4 participants