Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multimodal models such as LLaVA for image input #1568

Open
cebtenzzre opened this issue Oct 24, 2023 · 4 comments
Open

Support multimodal models such as LLaVA for image input #1568

cebtenzzre opened this issue Oct 24, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@cebtenzzre
Copy link
Member

Feature request

We can make use of the upstream work at ggerganov/llama.cpp#3436 to support image input to LLMs.

@AndriyMulyar What was the name of the model that you wanted to consider as an alternative to LLaVA?

Motivation

Real-time image recognition on resource-constrained hardware would be very useful in applications such as robotics. This feature would open the door to broader use cases for GPT4All than simple text completion.

Your contribution

I may submit a pull request implementing this functionality.

@cebtenzzre cebtenzzre added the enhancement New feature or request label Oct 24, 2023
@AndriyMulyar
Copy link
Contributor

Fuyu 8b is interesting because its decoder only.

I think LLaVA style is a fine choice though for an initial multimodal implementation

@manyoso
Copy link
Collaborator

manyoso commented Oct 24, 2023

This will require extensive changes to the GUI as well. It has been agreed that the GUI changes will come first to provide a UI for the current multimodel upstream.

@eiko4

This comment was marked as spam.

1 similar comment
@PedzacyKapec
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants