Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek VL finetune vision encoder? #543

Closed
SinanAkkoyun opened this issue Mar 12, 2024 · 22 comments
Closed

DeepSeek VL finetune vision encoder? #543

SinanAkkoyun opened this issue Mar 12, 2024 · 22 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested

Comments

@SinanAkkoyun
Copy link

Hi! Does finetuning deepseek VL also finetune the vision encoder?

@SinanAkkoyun SinanAkkoyun changed the title DeepSeek VL finetune DeepSeek VL finetune vision encoder? Mar 12, 2024
@Jintao-Huang
Copy link
Collaborator

This model currently only supports fine-tuning of the LLM portion

@Jintao-Huang Jintao-Huang self-assigned this Mar 13, 2024
@Jintao-Huang Jintao-Huang added the enhancement New feature or request label Mar 13, 2024
@wwzhuang01
Copy link

hi! is finetuning the aligner of deepseek-vl supported?

@Jintao-Huang
Copy link
Collaborator

I am trying to provide support.

@SinanAkkoyun
Copy link
Author

@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :)

@Jintao-Huang
Copy link
Collaborator

CPT:

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type deepseek-vl-7b-chat \
    --custom_train_dataset_path xxx.jsonl \
    --custom_val_dataset_path yyy.jsonl \
    --train_dataset_sample -1 \
    --sft_type full \
    --deepspeed default-zero2

@SinanAkkoyun
Copy link
Author

Wow! That was super quick, thank you so so much!!! ❤️

@SinanAkkoyun
Copy link
Author

In what format must the custom train dataset be? (And what does the val dataset do exactly?)

@soloice
Copy link

soloice commented Mar 14, 2024

Bro, you really rock!

@SinanAkkoyun
Copy link
Author

SinanAkkoyun commented Mar 14, 2024

{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]}

Is that the right format; would I now need to place <image_placeholder> where each image should go?
Also, is it possible to make multi-turn with multiple images?

@Jintao-Huang
Copy link
Collaborator

Format is similar to this:

[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

User: <image_placeholder>please describe the image.

Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<|end▁of▁sentence|>

@Jintao-Huang
Copy link
Collaborator

The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image.

@SinanAkkoyun
Copy link
Author

@Jintao-Huang
Thank you so much :)

I don't quite understand these points:

  • How would the training jsonl look like exactly?
  • How can I use the llava chat format or any other format for that matter (they use that for the deepseek vl chat models)?
  • What's the val dataset being used for and is it necessary?

Again thanks for your help!

@SinanAkkoyun SinanAkkoyun reopened this Mar 15, 2024
@Jintao-Huang Jintao-Huang added the question Further information is requested label Mar 15, 2024
@Jintao-Huang
Copy link
Collaborator

{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

@Jintao-Huang
Copy link
Collaborator

Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm.

@Jintao-Huang
Copy link
Collaborator

The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path.

@SinanAkkoyun
Copy link
Author

Thank you so so much!!! 🙏❤️

@Jintao-Huang Jintao-Huang added the good first issue Good for newcomers label Mar 17, 2024
@SinanAkkoyun
Copy link
Author

SinanAkkoyun commented Apr 17, 2024

Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work!

Upon further testing, I realized that DeepSeek VL supports multi-image prompts.
Could you please implement multi-image support for training? I'd love to train the model on some image comparisons

@SinanAkkoyun SinanAkkoyun reopened this Apr 17, 2024
@daihuangyu
Copy link

This format seems to only support the training method of images + text. Can it be training with plain text data?

@CudaMem
Copy link

CudaMem commented Apr 18, 2024

@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏

@Jintao-Huang
Copy link
Collaborator

@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏

--lora_target_modules设置为ALL就可以了,可以查看最佳实践哈,里面有写的。包括如何进行全参数训练

@SinanAkkoyun
Copy link
Author

Thank you so so much! 😊

@SinanAkkoyun
Copy link
Author

@Jintao-Huang How much VRAM is needed to finetune the 7b VL model?

# Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model_type qwen1half-7b-chat \
    --dataset blossom-math-zh \
    --num_train_epochs 5 \
    --sft_type full \
    --output_dir output \
    --eval_steps 500 \

The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants