DeepSeek VL finetune vision encoder? #543

SinanAkkoyun · 2024-03-12T23:02:56Z

Hi! Does finetuning deepseek VL also finetune the vision encoder?

Jintao-Huang · 2024-03-13T05:54:03Z

This model currently only supports fine-tuning of the LLM portion

wwzhuang01 · 2024-03-13T06:11:53Z

hi! is finetuning the aligner of deepseek-vl supported?

Jintao-Huang · 2024-03-13T10:01:01Z

I am trying to provide support.

SinanAkkoyun · 2024-03-13T10:14:49Z

@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :)

Jintao-Huang · 2024-03-13T22:07:48Z

CPT:

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type deepseek-vl-7b-chat \
    --custom_train_dataset_path xxx.jsonl \
    --custom_val_dataset_path yyy.jsonl \
    --train_dataset_sample -1 \
    --sft_type full \
    --deepspeed default-zero2

SinanAkkoyun · 2024-03-13T22:28:00Z

Wow! That was super quick, thank you so so much!!! ❤️

SinanAkkoyun · 2024-03-14T03:22:39Z

In what format must the custom train dataset be? (And what does the val dataset do exactly?)

soloice · 2024-03-14T03:43:17Z

Bro, you really rock!

SinanAkkoyun · 2024-03-14T03:45:41Z

{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]}

Is that the right format; would I now need to place <image_placeholder> where each image should go?
Also, is it possible to make multi-turn with multiple images?

Jintao-Huang · 2024-03-14T05:17:49Z

Format is similar to this:

[PROMPT]<｜begin▁of▁sentence｜>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

User: <image_placeholder>please describe the image.

Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<｜end▁of▁sentence｜>

Jintao-Huang · 2024-03-14T05:18:01Z

The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image.

SinanAkkoyun · 2024-03-15T11:36:47Z

@Jintao-Huang
Thank you so much :)

I don't quite understand these points:

How would the training jsonl look like exactly?
How can I use the llava chat format or any other format for that matter (they use that for the deepseek vl chat models)?
What's the val dataset being used for and is it necessary?

Again thanks for your help!

Jintao-Huang · 2024-03-15T23:44:16Z

{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

Jintao-Huang · 2024-03-15T23:47:44Z

Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm.

Jintao-Huang · 2024-03-15T23:49:13Z

The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path.

SinanAkkoyun · 2024-03-17T01:28:14Z

Thank you so so much!!! 🙏❤️

SinanAkkoyun · 2024-04-17T02:12:50Z

Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work!

Upon further testing, I realized that DeepSeek VL supports multi-image prompts.
Could you please implement multi-image support for training? I'd love to train the model on some image comparisons

daihuangyu · 2024-04-17T06:39:27Z

This format seems to only support the training method of images + text. Can it be training with plain text data？

CudaMem · 2024-04-18T16:46:05Z

@Jintao-Huang 您好！请问deepseek-vl暂时只能微调llm部分，不能训练aligner（connector）部分吗？谢谢🙏

Jintao-Huang · 2024-04-18T17:20:24Z

@Jintao-Huang 您好！请问deepseek-vl暂时只能微调llm部分，不能训练aligner（connector）部分吗？谢谢🙏

--lora_target_modules设置为ALL就可以了，可以查看最佳实践哈，里面有写的。包括如何进行全参数训练

SinanAkkoyun · 2024-04-20T11:56:54Z

Thank you so so much! 😊

SinanAkkoyun · 2024-04-23T05:48:16Z

@Jintao-Huang How much VRAM is needed to finetune the 7b VL model?

# Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model_type qwen1half-7b-chat \
    --dataset blossom-math-zh \
    --num_train_epochs 5 \
    --sft_type full \
    --output_dir output \
    --eval_steps 500 \

The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start

SinanAkkoyun changed the title ~~DeepSeek VL finetune~~ DeepSeek VL finetune vision encoder? Mar 12, 2024

Jintao-Huang self-assigned this Mar 13, 2024

Jintao-Huang added the enhancement New feature or request label Mar 13, 2024

Jintao-Huang mentioned this issue Mar 13, 2024

support deepseek vl finetune vision encoder #547

Merged

SinanAkkoyun closed this as completed Mar 13, 2024

SinanAkkoyun mentioned this issue Mar 13, 2024

Fine-tuning Script deepseek-ai/DeepSeek-VL#6

Open

This was referenced Mar 14, 2024

fix deepseek-vl 'eval_loss' not found bug #552

Merged

deepseek-vl如何支持多卡微调呢？ #553

Closed

SinanAkkoyun reopened this Mar 15, 2024

Jintao-Huang added the question Further information is requested label Mar 15, 2024

SinanAkkoyun closed this as completed Mar 17, 2024

Jintao-Huang added the good first issue Good for newcomers label Mar 17, 2024

SinanAkkoyun reopened this Apr 17, 2024

Jintao-Huang mentioned this issue Apr 18, 2024

DeepSeek-vl supports multi-image and plain text data #737

Merged

Jintao-Huang closed this as completed Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek VL finetune vision encoder? #543

DeepSeek VL finetune vision encoder? #543

SinanAkkoyun commented Mar 12, 2024

Jintao-Huang commented Mar 13, 2024

wwzhuang01 commented Mar 13, 2024

Jintao-Huang commented Mar 13, 2024

SinanAkkoyun commented Mar 13, 2024

Jintao-Huang commented Mar 13, 2024

SinanAkkoyun commented Mar 13, 2024

SinanAkkoyun commented Mar 14, 2024

soloice commented Mar 14, 2024 •

edited

Loading

SinanAkkoyun commented Mar 14, 2024 •

edited

Loading

Jintao-Huang commented Mar 14, 2024

Jintao-Huang commented Mar 14, 2024

SinanAkkoyun commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

SinanAkkoyun commented Mar 17, 2024

SinanAkkoyun commented Apr 17, 2024 •

edited

Loading

daihuangyu commented Apr 17, 2024

CudaMem commented Apr 18, 2024

Jintao-Huang commented Apr 18, 2024

SinanAkkoyun commented Apr 20, 2024

SinanAkkoyun commented Apr 23, 2024

DeepSeek VL finetune vision encoder? #543

DeepSeek VL finetune vision encoder? #543

Comments

SinanAkkoyun commented Mar 12, 2024

Jintao-Huang commented Mar 13, 2024

wwzhuang01 commented Mar 13, 2024

Jintao-Huang commented Mar 13, 2024

SinanAkkoyun commented Mar 13, 2024

Jintao-Huang commented Mar 13, 2024

SinanAkkoyun commented Mar 13, 2024

SinanAkkoyun commented Mar 14, 2024

soloice commented Mar 14, 2024 • edited Loading

SinanAkkoyun commented Mar 14, 2024 • edited Loading

Jintao-Huang commented Mar 14, 2024

Jintao-Huang commented Mar 14, 2024

SinanAkkoyun commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

Jintao-Huang commented Mar 15, 2024

SinanAkkoyun commented Mar 17, 2024

SinanAkkoyun commented Apr 17, 2024 • edited Loading

daihuangyu commented Apr 17, 2024

CudaMem commented Apr 18, 2024

Jintao-Huang commented Apr 18, 2024

SinanAkkoyun commented Apr 20, 2024

SinanAkkoyun commented Apr 23, 2024

soloice commented Mar 14, 2024 •

edited

Loading

SinanAkkoyun commented Mar 14, 2024 •

edited

Loading

SinanAkkoyun commented Apr 17, 2024 •

edited

Loading