Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-modal models #746

Closed
arian81 opened this issue Oct 10, 2023 · 21 comments
Closed

Support multi-modal models #746

arian81 opened this issue Oct 10, 2023 · 21 comments
Assignees
Labels
feature request New feature or request

Comments

@arian81
Copy link

arian81 commented Oct 10, 2023

This is one of the best open source multi modals based on llama 7 currently. It would nice to be able to host it in ollama.
https://llava-vl.github.io/

@ryansereno
Copy link

ryansereno commented Oct 10, 2023

Came here looking for this, to see if the discussion had begun surrounding this.
Curious to see what will be required to make this happen.

Edit: Progress is being made upstream in llama.cpp to support this.

@jmorganca jmorganca added the feature request New feature or request label Oct 11, 2023
@spielhoelle
Copy link

The PR @ryansereno mentioned is merged and in master now. How can we run this in ollama?

@marscod
Copy link

marscod commented Oct 15, 2023

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

@chigkim
Copy link

chigkim commented Oct 16, 2023

It would be good to have file reader command in the prompt like /read file.jpg for this.

@hugh-min
Copy link

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

@jmorganca jmorganca changed the title Support llava multi modal model Support multi-modal models Oct 24, 2023
@Bortus-AI
Copy link

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

I would like to know as well. Thanks

@tmc
Copy link

tmc commented Oct 29, 2023

it seems a couple of interface design decisions are are play: 1) how to represent this in the http api and 2) what the user/cli interface should be.

I want to note/highlight that the folks hacking on iTerm2 have done some work that may be relevant in the cli context here: https://iterm2.com/documentation-images.html

For the HTTP interface I'd suggest taking some inspiration of how OpenAI is folding in image data may be useful. I did a bit of protocol decoding and the TL;DR of how they do it is upload to blob store then include a special message type in the completion message list.

There's also a/the consideration of if it's an ollama concern to allow annotation of an incoming image to support highlighting part of the image. That feels a bit out of scope to start but perhaps the design should keep that in mind.

@sausheong
Copy link

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

I would like to know as well. Thanks

Me too, can explain how to map an image within ollama?

@itsPreto
Copy link

Love that this is marked as closed but everyone still clueless over here lol

@orkutmuratyilmaz
Copy link

@marscod thanks for importing the model. Can you type an example of API call, on the model page?

@mangiucugna
Copy link

mangiucugna commented Nov 21, 2023

So I figured how to use it, here's the code snippet:

with open("image.jpg", "rb") as f:
      encoded_string = base64.b64encode(f.read()).decode('utf-8')
  data = {"model": "marscod/llava", "prompt": f"USER: {encoded_string} {prompt}\nASSISTANT:", }
  try:
    response = requests.post(url="http://127.0.0.1:11434/api/generate", headers={"Content-Type": "application/json"}, json=data, stream=True)
  except Exception as e:
   # manage exception
  output = ""
  for chunk in response.text.split('\n'):
    chunk = json_repair.loads(chunk)
    if isinstance(chunk, dict):
      output += chunk.get("response") or ""

However it also throws this error: {"error":"error reading llm response: bufio.Scanner: token too long"}

For reference, I prefer using llama.cpp directly with bakllava-1 (way more precise) and the syntax there looks like this:

with open("image.jpg", "rb") as f:
      encoded_string = base64.b64encode(f.read()).decode('utf-8')
  image_data = [{"data": encoded_string, "id": 42}]
  data = {"prompt": f"USER:[img-42] {prompt}.\nASSISTANT:", "n_predict": 4000, "image_data": image_data, "stream": True}
  try:
    response = requests.post(url="http://localhost:8080/completion", headers={"Content-Type": "application/json"}, json=data, stream=True)
  except Exception as e:
    # Manage exception
  output = ""
  for chunk in response.iter_content(chunk_size=128):
    content = chunk.decode().strip().split('\n\n')[0]
    try:
        content_split = content.split('data: ')
        if len(content_split) > 1:
            content_json = json_repair.loads(content_split[1])
            output += content_json["content"]
            yield output
    except Exception as e:
       # Manage exception

This is taken from: https://github.com/mangiucugna/local_multimodal_ai

Hope this helps!

@ryansereno
Copy link

@mangiucugna thank you, will give it a try.
Hadn't heard of Bakllava before, very excited to try it.

@mangiucugna
Copy link

I imported bakllava-1 locally and did some tests and it performs so badly when compared to the llama.cpp implementation that is unusable.
I suspect that something is going wrong and the data arriving to the model is corrupted and that somehow {"error":"error reading llm response: bufio.Scanner: token too long"} is related.

Happy to share my Modelfile and link to the gguf for anyone to try to reproduce

@Kreijstal
Copy link

https://github.com/Mozilla-Ocho/llamafile llamafile supports llava-1.5 it would be nice if ollama supported it too

@mak448a
Copy link

mak448a commented Dec 15, 2023

Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux.

@arian81
Copy link
Author

arian81 commented Dec 15, 2023

Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux.

You probably haven't updated to the latest version of Ollama if you're getting a bunch of Chinese characters as the output.

@orkutmuratyilmaz
Copy link

I guess that we can consider this issue as completed :)

@pdevine pdevine closed this as completed Dec 16, 2023
@prologic
Copy link

When I try this I get:

$ ollama run llama2
>>> What's in this image? /Users/prologic/Downloads/IMG_1325.png

I cannot directly view or analyze the image you provided as it is a personal file located on a local computer. However, I can provide some general
information about images and how they can be analyzed.
...

And I'm using the l atest version of ollama:

$ ollama --version
ollama version is 0.1.17

@pdevine
Copy link
Contributor

pdevine commented Dec 26, 2023

@prologic llama2 isn't a multimodal model. You should try:

$ ollama run llava

@prologic
Copy link

Ahh! Thanks. When I tried to search for multimodel models the search turend up empty. This is why I wasn't able to figure this out so easily :/ There should be a way to list for and search for multimodel models, even with ollama search (does this sub-command exist?)

@schuster-rainer
Copy link

if you want to use it with langchain. here is what you need to add to the HumanMessage:

 HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": f"data:image/jpeg;base64,{img_base64}",
                    },
                ]
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests