Generalize multimodality #1741

Wojtab · 2023-05-02T16:37:50Z

This is a POC for #1687. I based it on this PR: #1664, so it has a few extra commits (I left llava extension here for reference, but it's going to get deprecated by multimodality)

Short description

The main gist of it, is to provide an unified framework for multimodal pipelines - those which still use LLM with only weights/biases/added tokens, not LLMs with added layers for multimodality. This is achieved by adding a new extension - called multimodality.
The hooks to text-generation-webui are the same as in llava, but I no longer use custom_generate_chat_prompt, instead providing a new extension hook tokenized_length - to be used instead of len(encode(prompt)), as the tokenizer extensions can modify the number of tokens.

Working principle

Multimodality extension does most of the stuff which is required for any image input:

adds the UI
saves the images as base64 JPEGs to history
provides the hooks to the UI
then, if there are images in the prompt, it:
- splits the prompt to text and image parts
- adds image start/end markers to text parts, then encodes and embeds the text parts
- calls the vision pipeline to embed the images
- stitches the embeddings together, and returns them to text generation
loads the appropriate vision pipeline, selected either from model name, or by specifying --multimodal-pipeline parameter

Now, for the pipelines, they:

load the required vision models
return some consts, for example the number of tokens taken up by image
and most importantly: return the embeddings, given a list of images

Pipelines

All of the pipelines should subclass AbstractMultimodalPipeline class. The idea is to allow for new pipelines to be added in the same way as user extensions - git clone into extensions/multimodal/pipelines.
For the POC I'm providing 2 built in pipelines:

llava-13b - for LLaVA v0 13B, for example wojtab/llava-13b-v0-4bit-128g
llava-7b - for LLaVA v0 7b, for example wojtab/llava-7b-v0-4bit-128g

And 1 pipeline outside the repository:

minigpt4-13b- for MiniGPT-4 13B, to run it:
- clone https://github.com/Wojtab/minigpt-4-pipeline into extensions/multimodal/pipelines
- install the requirements.txt
- use it with Vicuna-v0-13B, for example anon8231489123/vicuna-13b-GPTQ-4bit-128g
- note: it's hacked together quickly, as a showcase of this flow, it worked for me, but it might break

Pipeline modules

All of the pipeline modules should have a pipelines.py file, which has the following fields:

available_pipelines: List[str] - list of pipelines provided by this module, show only to the user
def get_pipeline(name: str, params: dict) -> Optional[AbstractMultimodalPipeline]: - a function to get a concrete pipeline by name, if name doesn't match any, should return None. params is the user settings for multimodal extension
def get_pipeline_from_model_name(model_name: str, params: dict) -> Optional[AbstractMultimodalPipeline]: - a function to get a pipeline from model_name, should be eager to return None, unless the determination can be done clearly (for example: minigpt-4 bases on vicuna - it should never return the pipeline, but llava can, as it has its own specific LLM finetune)

A pipeline module should lazy-import the pipelines only when necessary, and it should keep its imports to minimum

Example

LLaVA-13B	LLaVA-7B	MiniGPT-4 13B

Looking for feedback

…currently

Wojtab · 2023-05-04T21:07:32Z

still working on minigpt4/docs, but:

good news: looks like LLM's can be switched in some pipelines, so it's going to be fun mixing and matching
bad news: LLMs can be switched, and MiniGPT-4 7B works with Pygmalion 7b, so yeah...
(python3 server.py --model pygmalion-7b --chat --extensions multimodal --multimodal-pipeline minigpt4-7b)

james-s-tayler · 2023-05-05T12:57:54Z

This is miracle-tier.

oobabooga · 2023-05-05T16:08:51Z

This is looking good so far.

Wojtab · 2023-05-05T19:14:24Z

I think it's ready now, sorry for the PR size

oobabooga · 2023-05-09T23:17:21Z

Sorry for taking so long to review this. This is one of those foundational PRs that are of major importance but don't receive immediate attention from users. Probably because of your modest title ;)

I have tested all pipelines and everything just worked:

python server.py --model wojtab_llava-7b-v0-4bit-128g --multimodal-pipeline llava-7b --chat 
python3 server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --chat 
python server.py --model anon8231489123_vicuna-13b-GPTQ-4bit-128g --multimodal-pipeline minigpt4-13b --chat
python server.py --model llama-7b-4bit --multimodal-pipeline minigpt4-7b --chat

I have made some minor changes:

Gave specific examples of commands in the multimodal README, so that people can copy and paste
Made the multimodal extension be loaded automatically if --multimodal-pipeline is supplied
Added a mention to the main README

Thanks a lot for yet another brilliant PR!

Wojtab added 9 commits April 30, 2023 02:01

change multimodal projector to the correct one

ba439c0

remove reference to custom stopping strings from readme

0f85a4b

fix stopping strings if tokenizer extension adds/removes tokens

8be652d

add API example

38ab214

LLaVA 7B just dropped, add to readme that there is no support for it …

4b542ec

…currently

remove custom_generate_chat_prompt from LLaVA

e9d2697

add multimodal

75006a9

fix

9cc054f

Merge branch 'main' into generalize-multimodality

8df5861

Wojtab marked this pull request as draft May 2, 2023 16:37

Merge branch 'main' into generalize-multimodality

ab413c7

Wojtab added 2 commits May 5, 2023 21:09

add docs, deprecate llava extension, use logging

f0c6b15

Merge branch 'main' into generalize-multimodality

59101a7

Wojtab marked this pull request as ready for review May 5, 2023 19:11

Wojtab changed the title ~~[POC] Generalize multimodality~~ Generalize multimodality May 5, 2023

Wojtab mentioned this pull request May 5, 2023

Refactor text_generation.py, add support for custom generation functions #1817

Merged

Wojtab and others added 6 commits May 6, 2023 00:08

Merge branch 'main' into generalize-multimodality

b0e5629

Auto-activate the multimodal extension

0d84a17

Merge branch 'main' into Wojtab-generalize-multimodality

600ad1e

Sort import / update examples in docs

043d7a1

Mention in the main README

3d5b8eb

Update parameter descriptions

1d7f4e5

oobabooga merged commit e9e75a9 into oobabooga:main May 9, 2023

oobabooga mentioned this pull request May 9, 2023

Add Support for MiniGPT-4 #1312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize multimodality #1741

Generalize multimodality #1741

Wojtab commented May 2, 2023 •

edited

Wojtab commented May 4, 2023 •

edited

james-s-tayler commented May 5, 2023

oobabooga commented May 5, 2023

Wojtab commented May 5, 2023

oobabooga commented May 9, 2023

Generalize multimodality #1741

Generalize multimodality #1741

Conversation

Wojtab commented May 2, 2023 • edited

Short description

Working principle

Pipelines

Pipeline modules

Example

Wojtab commented May 4, 2023 • edited

james-s-tayler commented May 5, 2023

oobabooga commented May 5, 2023

Wojtab commented May 5, 2023

oobabooga commented May 9, 2023

Wojtab commented May 2, 2023 •

edited

Wojtab commented May 4, 2023 •

edited