Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Multimodality Support #679

Closed
4 tasks
Kathryn-cat opened this issue Aug 7, 2023 · 8 comments
Closed
4 tasks

[Tracking] Multimodality Support #679

Kathryn-cat opened this issue Aug 7, 2023 · 8 comments
Assignees
Labels
status: tracking Tracking work in progress

Comments

@Kathryn-cat
Copy link
Contributor

Overview

Currently, we have multimodality support for MiniGPT4, but we have not concretized a high-level Python API for that, and we have not announced its CLI and iOS support yet. Also, we need to look into more modals like LLaVA etc.

Action Items

  • test and tune MiniGPT4 module to ensure fast performance
  • bring in high-level Python API that supports multimodal generation
  • LLaVA model support
  • concretize doc on the usage and make announcement

Links to Related Issues and PRs

@Kathryn-cat Kathryn-cat added the status: tracking Tracking work in progress label Aug 7, 2023
@Kathryn-cat Kathryn-cat self-assigned this Aug 7, 2023
@JianbangZ
Copy link

@Kathryn-cat Update on LLAVA progress?

@Kathryn-cat
Copy link
Contributor Author

Hi @JianbangZ , sorry for the delay. We recently underwent a huge refactorization of Python/C++ and iOS codebase, so hopefully we can officially introduce it in the next week or two.

Side question: Would you prefer running LLAVA on MLC in the Gradio frontend (a webpage for uploading the images) or in phone environments (iPhone allowing you to take a picture and ask questions)?

@JianbangZ
Copy link

Hi @JianbangZ , sorry for the delay. We recently underwent a huge refactorization of Python/C++ and iOS codebase, so hopefully we can officially introduce it in the next week or two.

Side question: Would you prefer running LLAVA on MLC in the Gradio frontend (a webpage for uploading the images) or in phone environments (iPhone allowing you to take a picture and ask questions)?

I think a Gradio demonstration with full Vulkan would be great

@Kathryn-cat
Copy link
Contributor Author

got it! We're working on it in progress now and will release it soon.

@dusty-nv
Copy link

dusty-nv commented Sep 16, 2023

+1 for llava-llama-2-chat support! Any updates or timeline for this @Kathryn-cat?

I see you have a dev branch here: https://github.com/Kathryn-cat/mlc-llm/tree/pr-llava-support

It would seem the approach is to include inside MLC chat the CLIP encoder, projection, and embeddings. Yes, it's nice for it to run out-of-the-box like that - however for flexibility, it would also be useful to be able to embed your own tokens into the prompt. Llava is just a Llama model that has the image patch tokens embedded in the prompt. For example, I could run CLIP with TensorRT (although if MLC is fast enough with it, I can just use that)

EDIT: Would --sep-embed and prefill_with_embed() from #419 (comment) be the correct mlc_chat API for that?

@acalatrava
Copy link
Contributor

What's the status of this? I just saw the latest LLaVA version and it seems pretty cool! It would be great to have it in MLC-LLM

@Smaran222
Copy link

I want to use the QwenVL model with 4 billion parameters on a mobile app. So is there any update on multimodality support? Because I would love to be able to run it offline on phone with MLC LLM. If not, would I have to use TVM to do that?

@MasterJH5574
Copy link
Collaborator

A late update - we have supported llava starting #1974. So we can conclude this issue for now I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: tracking Tracking work in progress
Projects
Status: Done
Development

No branches or pull requests

6 participants