-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek VL finetune vision encoder? #543
Comments
This model currently only supports fine-tuning of the LLM portion |
hi! is finetuning the aligner of deepseek-vl supported? |
I am trying to provide support. |
@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :) |
CPT: NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type deepseek-vl-7b-chat \
--custom_train_dataset_path xxx.jsonl \
--custom_val_dataset_path yyy.jsonl \
--train_dataset_sample -1 \
--sft_type full \
--deepspeed default-zero2 |
Wow! That was super quick, thank you so so much!!! ❤️ |
In what format must the custom train dataset be? (And what does the val dataset do exactly?) |
Bro, you really rock! |
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]} Is that the right format; would I now need to place |
Format is similar to this:
|
The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image. |
@Jintao-Huang I don't quite understand these points:
Again thanks for your help! |
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]} |
Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm. |
The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path. |
Thank you so so much!!! 🙏❤️ |
Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work! Upon further testing, I realized that DeepSeek VL supports multi-image prompts. |
This format seems to only support the training method of images + text. Can it be training with plain text data? |
@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏 |
--lora_target_modules设置为ALL就可以了,可以查看最佳实践哈,里面有写的。包括如何进行全参数训练 |
Thank you so so much! 😊 |
@Jintao-Huang How much VRAM is needed to finetune the 7b VL model? # Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model_type qwen1half-7b-chat \
--dataset blossom-math-zh \
--num_train_epochs 5 \
--sft_type full \
--output_dir output \
--eval_steps 500 \ The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start |
Hi! Does finetuning deepseek VL also finetune the vision encoder?
The text was updated successfully, but these errors were encountered: