Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize multimodality #1741

Merged
merged 18 commits into from
May 9, 2023
Merged

Conversation

Wojtab
Copy link
Contributor

@Wojtab Wojtab commented May 2, 2023

This is a POC for #1687. I based it on this PR: #1664, so it has a few extra commits (I left llava extension here for reference, but it's going to get deprecated by multimodality)

Short description

The main gist of it, is to provide an unified framework for multimodal pipelines - those which still use LLM with only weights/biases/added tokens, not LLMs with added layers for multimodality. This is achieved by adding a new extension - called multimodality.
The hooks to text-generation-webui are the same as in llava, but I no longer use custom_generate_chat_prompt, instead providing a new extension hook tokenized_length - to be used instead of len(encode(prompt)), as the tokenizer extensions can modify the number of tokens.

Working principle

Multimodality extension does most of the stuff which is required for any image input:

  • adds the UI
  • saves the images as base64 JPEGs to history
  • provides the hooks to the UI
  • then, if there are images in the prompt, it:
    • splits the prompt to text and image parts
    • adds image start/end markers to text parts, then encodes and embeds the text parts
    • calls the vision pipeline to embed the images
    • stitches the embeddings together, and returns them to text generation
  • loads the appropriate vision pipeline, selected either from model name, or by specifying --multimodal-pipeline parameter

Now, for the pipelines, they:

  • load the required vision models
  • return some consts, for example the number of tokens taken up by image
  • and most importantly: return the embeddings, given a list of images

Pipelines

All of the pipelines should subclass AbstractMultimodalPipeline class. The idea is to allow for new pipelines to be added in the same way as user extensions - git clone into extensions/multimodal/pipelines.
For the POC I'm providing 2 built in pipelines:

  • llava-13b - for LLaVA v0 13B, for example wojtab/llava-13b-v0-4bit-128g
  • llava-7b - for LLaVA v0 7b, for example wojtab/llava-7b-v0-4bit-128g

And 1 pipeline outside the repository:

  • minigpt4-13b- for MiniGPT-4 13B, to run it:
    • clone https://github.com/Wojtab/minigpt-4-pipeline into extensions/multimodal/pipelines
    • install the requirements.txt
    • use it with Vicuna-v0-13B, for example anon8231489123/vicuna-13b-GPTQ-4bit-128g
    • note: it's hacked together quickly, as a showcase of this flow, it worked for me, but it might break

Pipeline modules

All of the pipeline modules should have a pipelines.py file, which has the following fields:

  • available_pipelines: List[str] - list of pipelines provided by this module, show only to the user
  • def get_pipeline(name: str, params: dict) -> Optional[AbstractMultimodalPipeline]: - a function to get a concrete pipeline by name, if name doesn't match any, should return None. params is the user settings for multimodal extension
  • def get_pipeline_from_model_name(model_name: str, params: dict) -> Optional[AbstractMultimodalPipeline]: - a function to get a pipeline from model_name, should be eager to return None, unless the determination can be done clearly (for example: minigpt-4 bases on vicuna - it should never return the pipeline, but llava can, as it has its own specific LLM finetune)

A pipeline module should lazy-import the pipelines only when necessary, and it should keep its imports to minimum

Example

LLaVA-13B LLaVA-7B MiniGPT-4 13B
image image image

Looking for feedback

@Wojtab Wojtab marked this pull request as draft May 2, 2023 16:37
@Wojtab
Copy link
Contributor Author

Wojtab commented May 4, 2023

still working on minigpt4/docs, but:

good news: looks like LLM's can be switched in some pipelines, so it's going to be fun mixing and matching
bad news: LLMs can be switched, and MiniGPT-4 7B works with Pygmalion 7b, so yeah...
(python3 server.py --model pygmalion-7b --chat --extensions multimodal --multimodal-pipeline minigpt4-7b)
image

@james-s-tayler
Copy link

This is miracle-tier.

@oobabooga
Copy link
Owner

This is looking good so far.

@Wojtab Wojtab marked this pull request as ready for review May 5, 2023 19:11
@Wojtab
Copy link
Contributor Author

Wojtab commented May 5, 2023

I think it's ready now, sorry for the PR size

@Wojtab Wojtab changed the title [POC] Generalize multimodality Generalize multimodality May 5, 2023
@oobabooga
Copy link
Owner

Sorry for taking so long to review this. This is one of those foundational PRs that are of major importance but don't receive immediate attention from users. Probably because of your modest title ;)

I have tested all pipelines and everything just worked:

python server.py --model wojtab_llava-7b-v0-4bit-128g --multimodal-pipeline llava-7b --chat 
python3 server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --chat 
python server.py --model anon8231489123_vicuna-13b-GPTQ-4bit-128g --multimodal-pipeline minigpt4-13b --chat
python server.py --model llama-7b-4bit --multimodal-pipeline minigpt4-7b --chat

I have made some minor changes:

  1. Gave specific examples of commands in the multimodal README, so that people can copy and paste
  2. Made the multimodal extension be loaded automatically if --multimodal-pipeline is supplied
  3. Added a mention to the main README

Thanks a lot for yet another brilliant PR!

@oobabooga oobabooga merged commit e9e75a9 into oobabooga:main May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants