Multimodal support #1216

pdevine · 2023-11-21T02:55:55Z

This PR builds off of @mattapperson's work, but with a more ollama-like UX + API.

pdevine · 2023-11-22T00:04:06Z

Here's a demo of multi-modal in action:

multimodal3.mov

pdevine · 2023-11-22T06:58:13Z

To try it out, you can use the pdevine/llava-1.5:q4_k model. There's also a non-quantized model which you can pull at pdevine/llava-1.5. Details here

igorschlum · 2023-11-22T08:00:42Z

Hello @pdevine, this is great. Is there a way to have support for .txt ou .cvs files in the same syntaxe in Ollama?

(base) igor@macIgor ~ % ollama run llama2-uncensored

describe this text: /Users/igor/song.txt

is the capacity to read a file in Llava or in Ollama?

pdevine · 2023-11-23T18:37:58Z

@igorschlum That's a really great idea. Reading the image file comes from the REPL in Ollama and then it is passed to the model runner. It would have to work differently for embeddings or prompt stuffing. I'll need to think about how that could work.

suavelizard · 2023-11-24T03:01:07Z

Is there a way to test this via the REST API?

Not sure how to get the image data through

{
    "model": "pdevine/llava-1.5:13b",
    "prompt": "Describe this image {image_index}",
    "stream": false,
    "image_data": [] // Base64?
}

Edit this worked where image_data is an array of base64 encoded images (I only tested with 1)

{
    "model": "pdevine/llava-1.5:13b",
    "prompt": "Describe this image",
    "stream": false,
    "image_data": []
}

api/types.go

cmd/cmd.go

parser/parser.go

igorschlum · 2023-11-27T18:56:41Z

Hi @pdevine ,

I still think it would be beneficial to provide a path for a JSON file to be analyzed by Ollama. When I attempt to use the @(cat method, I encounter an issue when providing a path to a JSON file containing quotes. (See below.)

$ ollama run llama2 --verbose "$(cat /Users/igor/fr.json" Please translate in spanish this json preserving the json format and do not change the name of each key, first item of each line of the json
dquote cmdsubst cmdsubst dquote>

pdevine · 2023-11-27T22:28:19Z

@igorschlum Definitely agree, but that can be a follow up PR. There are lots of things to consider with that change as I was alluding to before.

igorschlum · 2023-11-28T11:16:04Z

@pdevine I understand.

cmd/cmd.go

api/types.go

cmd/cmd.go

jmorganca

Small nit-level comments but this is looking great!

jmorganca · 2023-12-06T05:49:49Z

docs/modelfile.md

@@ -150,6 +150,7 @@ PARAMETER <parameter> <parametervalue>
 | top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                        | int        | top_k 40             |


Can be in a follow up PR, we should add a section in api.md for this 😃

mxyng · 2023-12-06T21:42:03Z

cmd/cmd.go

+	}
+
+	contentType := http.DetectContentType(data)
+	allowedTypes := []string{"image/jpeg", "image/jpg", "image/svg+xml", "image/png"}


Is this defined by the model? Can it recognize other image types, e.g. bmps, gifs

The list of possible values returned by DetectContentType is very short but contain some values not seen here

Ideally there would be metadata from the model which told us the list of supported mimetypes. I made the list a few weeks ago from the llava documentation, but we can explore changing this in the future.

cmd/cmd.go

mxyng · 2023-12-06T21:52:46Z

cmd/cmd.go

+func extractFileNames(input string) (string, []ImageData, error) {
+	// Regex to match file paths starting with / or ./ and include escaped spaces (\ or %20)
+	// and followed by more characters and a file extension
+	regexPattern := `(?:\./|/)[\S\\ ]+?\.(?:jpg|jpeg|png|svg)\b`


Relying on file extensions is not ideal

I have a question for file extension. If the Model is able to describe a jpg, is it difficult to accept a path to a mp3 for a model that could convert the sound to text?

Certainly not ideal, but there's not really a great way that I can think of to do this well. The mac drag-and-drop into a text window just inserts the file name.

I think it could be nice to give the path to a file in the prompt and Ollama could read the data of that path and send it to the Model to be Handle. In the video you provided 2 weeks ego, you show that. I was wondering, why just pics and not mp3 or json.

@igorschlum both of those could be possible in the future, but different types of files rely on having support in the library we use to run the LLM.

It also relies on the models themselves to be designed and tuned for those use-cases too. At the moment image-to-text models are easier for us to support.

mxyng · 2023-12-06T21:55:53Z

cmd/cmd.go

+					ctx = context.WithValue(ctx, generateContextKey("context"), []int{})
+					cmd.SetContext(ctx)
+				}
+				if len(opts.Images) == 0 {


I don't know if this is a good idea. At its core, a llava model is still a language model and it's possible to interact with it as just a text completion or chat model

Was thinking this too! It works very well sans-image, and modern LLMs seem to blend both together so the user can do either with a single model

I thought it would be transparent for the LLM as the text from the file at the path would be transmeeted to the LLM. I see in the code that the image is transmitted in Base64, so it's why you don't want to use this syntax for XML, csv, json or txt files. For mp3, Base64 is also used.

I'm dreaming to be able to use one way whisper with Ollama and be able to provide mp3 for conversion.
https://huggingface.co/openai/whisper-large-v3

For txt or json, I know I can type in terminal:
ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/song.txt)"
but I'm obliged to return to terminal to prompt and it's not simple for non Unix users.

When I use this command inside Ollama, I get an error:

ollama run llama2
))) please translate in spanish "$(cat /Users/igor/song.txt)"

The command $(cat /Users/igor/song.txt) is a Unix or Linux command that
uses the cat command to display the contents of a file located at
/Users/igor/song.txt.

It would be nice to prompt
))) please translate in spanish this text /Users/igor/song.txt
I would drag and drop the txt, json or csv file on terminal window. Usage would be much more simple as you do on the video up for jpg.

Anyway, you do a super job and I really enjoy Ollama as you do and improve it.

I'm extremely reluctant to take this out because with the llava model it's pretty much useless until an image is added. I get that you can get it to answer a question, but that feels like a degenerate usecase for the current model. The kicker for me is that there is no indication to the user that they can even add an image.

LLaVA is better at understanding the context of text and images and using that information to answer questions or generate text than Llama2.

if my sample.txt file is:

C'est l'histoire d'un petit chat qui chantait "miaou miaou" tout le temps.

And translate it also in Italian

if my prompt is :

ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/sample.txt)"

The answer will be:

Spanish: "Es la historia de un pequeño gato que cantaba 'miau miau' todo el tiempo."

Italian: "Questa è la storia di un gattino che cantava 'miao miao' sempre."

The text in the sample.txt file is sent to Ollama and is interpreted as being part of the prompt, not part of the file that must be processed.

By separating the text from the prompt, LLaVA can focus on processing the text content independently, leading to more efficient and accurate responses. This approach could be particularly useful for tasks that require extensive text processing, such as translation or summarization.

cmd/cmd.go

mxyng · 2023-12-08T19:17:49Z

cmd/cmd.go

+	}
+	defer file.Close()
+
+	buf := make([]byte, 512)


bytes.Buffer is probably more appropriate. you can do something like this where it just appends to the 512

var b bytes.Buffer if _, err := io.CopyN(&b, file); err != nil { // return err } contentType := http.DetectContentType(b.Bytes()) if !slices.Contains(types, contentType) { // return err } if _, err := io.Copy(&b, file); err != nil { // return err }

We still have to stat the file and only read if the file is < 100MB. I feel like we're splitting hairs here.

cmd/cmd.go

llm/llama.go

BruceMacD

ship

eddwinpaz · 2023-12-11T19:10:33Z

cmd/cmd.go

@@ -902,6 +934,26 @@ func generateInteractive(cmd *cobra.Command, opts generateOptions) error {

 		if len(prompt) > 0 && multiline == MultilineNone {


I think you can abstract this if conditions to a function so it can be easy to test. nice work btw.

--------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>

pdevine force-pushed the multimodal branch from 47cea8e to 1b36c3e Compare November 21, 2023 22:08

pdevine mentioned this pull request Nov 22, 2023

Add support for Multimodel models #1082

Closed

dimfeld mentioned this pull request Nov 22, 2023

Running multimodal models #1235

Closed

BruceMacD reviewed Nov 24, 2023

View reviewed changes

api/types.go Outdated Show resolved Hide resolved

BruceMacD reviewed Nov 24, 2023

View reviewed changes

api/types.go Outdated Show resolved Hide resolved

BruceMacD reviewed Nov 24, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

BruceMacD reviewed Nov 24, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

BruceMacD reviewed Nov 24, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

BruceMacD reviewed Nov 24, 2023

View reviewed changes

parser/parser.go Outdated Show resolved Hide resolved

suavelizard mentioned this pull request Nov 27, 2023

Add support for Ollama multimodal vercel/modelfusion#190

Closed

pdevine force-pushed the multimodal branch from be44f83 to e77803d Compare November 27, 2023 22:14

dimfeld reviewed Nov 28, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

jmorganca reviewed Dec 1, 2023

View reviewed changes

api/types.go Outdated Show resolved Hide resolved

jmorganca reviewed Dec 1, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

jmorganca reviewed Dec 1, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

jmorganca reviewed Dec 1, 2023

View reviewed changes

pdevine force-pushed the multimodal branch 2 times, most recently from 9a2ac73 to 3090d08 Compare December 5, 2023 01:59

nvms mentioned this pull request Dec 5, 2023

Using the new version with local open source backends nvms/wingman#29

Open

pdevine force-pushed the multimodal branch 2 times, most recently from 4e14f3c to 1c4fdb2 Compare December 5, 2023 23:38

pdevine marked this pull request as ready for review December 5, 2023 23:39

jmorganca approved these changes Dec 6, 2023

View reviewed changes

jmorganca reviewed Dec 6, 2023

View reviewed changes

pdevine force-pushed the multimodal branch from 1c4fdb2 to 65c7f7f Compare December 6, 2023 21:28

mxyng reviewed Dec 6, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

pdevine force-pushed the multimodal branch from 65c7f7f to e72f917 Compare December 6, 2023 22:06

BruceMacD reviewed Dec 6, 2023

View reviewed changes

cmd/cmd.go Outdated Show resolved Hide resolved

mxyng reviewed Dec 8, 2023

View reviewed changes

BruceMacD approved these changes Dec 8, 2023

View reviewed changes

pdevine force-pushed the multimodal branch from d71d602 to d1c06bf Compare December 11, 2023 16:10

eddwinpaz reviewed Dec 11, 2023

View reviewed changes

Matt Apperson and others added 11 commits December 11, 2023 13:47

add multimodal support

6d6ae5e

fixes

0645268

Make multimodal work w/ new projector backend

3bf06b1

remove filename from prompt

9839a95

more fixes

89756ac

comments

0c0225e

change image data

3f32cad

model details

48ec901

fixes

7e9e773

fixup

6dedc26

add model details to tags endpoint

3cbc2e0

pdevine force-pushed the multimodal branch from e4290af to 3cbc2e0 Compare December 11, 2023 21:47

jmorganca approved these changes Dec 11, 2023

View reviewed changes

pdevine merged commit 910e940 into main Dec 11, 2023

pdevine deleted the multimodal branch December 11, 2023 21:56

emsi pushed a commit to emsi/ollama that referenced this pull request Dec 13, 2023

Multimodal support (ollama#1216)

1fb03ff

--------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal support #1216

Multimodal support #1216

pdevine commented Nov 21, 2023

pdevine commented Nov 22, 2023

pdevine commented Nov 22, 2023

igorschlum commented Nov 22, 2023 •

edited

pdevine commented Nov 23, 2023 •

edited

suavelizard commented Nov 24, 2023 •

edited

igorschlum commented Nov 27, 2023

pdevine commented Nov 27, 2023

igorschlum commented Nov 28, 2023

jmorganca left a comment

jmorganca Dec 6, 2023

mxyng Dec 6, 2023

mxyng Dec 6, 2023 •

edited

pdevine Dec 7, 2023

mxyng Dec 6, 2023

igorschlum Dec 6, 2023

pdevine Dec 6, 2023

igorschlum Dec 6, 2023

BruceMacD Dec 6, 2023

mxyng Dec 6, 2023

jmorganca Dec 6, 2023 •

edited

igorschlum Dec 7, 2023 •

edited

pdevine Dec 7, 2023

igorschlum Dec 7, 2023

mxyng Dec 8, 2023

pdevine Dec 11, 2023

BruceMacD left a comment

eddwinpaz Dec 11, 2023

		@@ -150,6 +150,7 @@ PARAMETER <parameter> <parametervalue>
		\| top_k \| Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) \| int \| top_k 40 \|

		@@ -902,6 +934,26 @@ func generateInteractive(cmd *cobra.Command, opts generateOptions) error {

		if len(prompt) > 0 && multiline == MultilineNone {

Multimodal support #1216

Multimodal support #1216

Conversation

pdevine commented Nov 21, 2023

pdevine commented Nov 22, 2023

pdevine commented Nov 22, 2023

igorschlum commented Nov 22, 2023 • edited

pdevine commented Nov 23, 2023 • edited

suavelizard commented Nov 24, 2023 • edited

igorschlum commented Nov 27, 2023

pdevine commented Nov 27, 2023

igorschlum commented Nov 28, 2023

jmorganca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxyng Dec 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmorganca Dec 6, 2023 • edited

Choose a reason for hiding this comment

igorschlum Dec 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BruceMacD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igorschlum commented Nov 22, 2023 •

edited

pdevine commented Nov 23, 2023 •

edited

suavelizard commented Nov 24, 2023 •

edited

mxyng Dec 6, 2023 •

edited

jmorganca Dec 6, 2023 •

edited

igorschlum Dec 7, 2023 •

edited