Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal support #1216

Merged
merged 11 commits into from Dec 11, 2023
Merged

Multimodal support #1216

merged 11 commits into from Dec 11, 2023

Conversation

pdevine
Copy link
Contributor

@pdevine pdevine commented Nov 21, 2023

This PR builds off of @mattapperson's work, but with a more ollama-like UX + API.

@pdevine
Copy link
Contributor Author

pdevine commented Nov 22, 2023

Here's a demo of multi-modal in action:

multimodal3.mov

@pdevine
Copy link
Contributor Author

pdevine commented Nov 22, 2023

To try it out, you can use the pdevine/llava-1.5:q4_k model. There's also a non-quantized model which you can pull at pdevine/llava-1.5. Details here

@igorschlum
Copy link

igorschlum commented Nov 22, 2023

Hello @pdevine, this is great. Is there a way to have support for .txt ou .cvs files in the same syntaxe in Ollama?

(base) igor@macIgor ~ % ollama run llama2-uncensored

describe this text: /Users/igor/song.txt

is the capacity to read a file in Llava or in Ollama?

@pdevine
Copy link
Contributor Author

pdevine commented Nov 23, 2023

@igorschlum That's a really great idea. Reading the image file comes from the REPL in Ollama and then it is passed to the model runner. It would have to work differently for embeddings or prompt stuffing. I'll need to think about how that could work.

@suavelizard
Copy link

suavelizard commented Nov 24, 2023

Is there a way to test this via the REST API?

Not sure how to get the image data through

{
    "model": "pdevine/llava-1.5:13b",
    "prompt": "Describe this image {image_index}",
    "stream": false,
    "image_data": [] // Base64?
}

Edit this worked where image_data is an array of base64 encoded images (I only tested with 1)

{
    "model": "pdevine/llava-1.5:13b",
    "prompt": "Describe this image",
    "stream": false,
    "image_data": []
}

api/types.go Outdated Show resolved Hide resolved
api/types.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
parser/parser.go Outdated Show resolved Hide resolved
@igorschlum
Copy link

Hi @pdevine ,

I still think it would be beneficial to provide a path for a JSON file to be analyzed by Ollama. When I attempt to use the @(cat method, I encounter an issue when providing a path to a JSON file containing quotes. (See below.)

$ ollama run llama2 --verbose "$(cat /Users/igor/fr.json" Please translate in spanish this json preserving the json format and do not change the name of each key, first item of each line of the json
dquote cmdsubst cmdsubst dquote>

@pdevine
Copy link
Contributor Author

pdevine commented Nov 27, 2023

@igorschlum Definitely agree, but that can be a follow up PR. There are lots of things to consider with that change as I was alluding to before.

@igorschlum
Copy link

@pdevine I understand.

cmd/cmd.go Outdated Show resolved Hide resolved
api/types.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
Copy link
Member

@jmorganca jmorganca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit-level comments but this is looking great!

@pdevine pdevine force-pushed the multimodal branch 2 times, most recently from 9a2ac73 to 3090d08 Compare December 5, 2023 01:59
@pdevine pdevine force-pushed the multimodal branch 2 times, most recently from 4e14f3c to 1c4fdb2 Compare December 5, 2023 23:38
@pdevine pdevine marked this pull request as ready for review December 5, 2023 23:39
@@ -150,6 +150,7 @@ PARAMETER <parameter> <parametervalue>
| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be in a follow up PR, we should add a section in api.md for this 😃

}

contentType := http.DetectContentType(data)
allowedTypes := []string{"image/jpeg", "image/jpg", "image/svg+xml", "image/png"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this defined by the model? Can it recognize other image types, e.g. bmps, gifs

Copy link
Contributor

@mxyng mxyng Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of possible values returned by DetectContentType is very short but contain some values not seen here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there would be metadata from the model which told us the list of supported mimetypes. I made the list a few weeks ago from the llava documentation, but we can explore changing this in the future.

cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated
func extractFileNames(input string) (string, []ImageData, error) {
// Regex to match file paths starting with / or ./ and include escaped spaces (\ or %20)
// and followed by more characters and a file extension
regexPattern := `(?:\./|/)[\S\\ ]+?\.(?:jpg|jpeg|png|svg)\b`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on file extensions is not ideal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question for file extension. If the Model is able to describe a jpg, is it difficult to accept a path to a mp3 for a model that could convert the sound to text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not ideal, but there's not really a great way that I can think of to do this well. The mac drag-and-drop into a text window just inserts the file name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be nice to give the path to a file in the prompt and Ollama could read the data of that path and send it to the Model to be Handle. In the video you provided 2 weeks ego, you show that. I was wondering, why just pics and not mp3 or json.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igorschlum both of those could be possible in the future, but different types of files rely on having support in the library we use to run the LLM.

It also relies on the models themselves to be designed and tuned for those use-cases too. At the moment image-to-text models are easier for us to support.

ctx = context.WithValue(ctx, generateContextKey("context"), []int{})
cmd.SetContext(ctx)
}
if len(opts.Images) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is a good idea. At its core, a llava model is still a language model and it's possible to interact with it as just a text completion or chat model

Copy link
Member

@jmorganca jmorganca Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking this too! It works very well sans-image, and modern LLMs seem to blend both together so the user can do either with a single model

Copy link

@igorschlum igorschlum Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be transparent for the LLM as the text from the file at the path would be transmeeted to the LLM. I see in the code that the image is transmitted in Base64, so it's why you don't want to use this syntax for XML, csv, json or txt files. For mp3, Base64 is also used.

I'm dreaming to be able to use one way whisper with Ollama and be able to provide mp3 for conversion.
https://huggingface.co/openai/whisper-large-v3

For txt or json, I know I can type in terminal:
ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/song.txt)"
but I'm obliged to return to terminal to prompt and it's not simple for non Unix users.

When I use this command inside Ollama, I get an error:

ollama run llama2
))) please translate in spanish "$(cat /Users/igor/song.txt)"

The command $(cat /Users/igor/song.txt) is a Unix or Linux command that
uses the cat command to display the contents of a file located at
/Users/igor/song.txt.

It would be nice to prompt
))) please translate in spanish this text /Users/igor/song.txt
I would drag and drop the txt, json or csv file on terminal window. Usage would be much more simple as you do on the video up for jpg.

Anyway, you do a super job and I really enjoy Ollama as you do and improve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm extremely reluctant to take this out because with the llava model it's pretty much useless until an image is added. I get that you can get it to answer a question, but that feels like a degenerate usecase for the current model. The kicker for me is that there is no indication to the user that they can even add an image.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLaVA is better at understanding the context of text and images and using that information to answer questions or generate text than Llama2.

if my sample.txt file is:

C'est l'histoire d'un petit chat qui chantait "miaou miaou" tout le temps.

And translate it also in Italian

if my prompt is :

ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/sample.txt)"

The answer will be:

Spanish: "Es la historia de un pequeño gato que cantaba 'miau miau' todo el tiempo."

Italian: "Questa è la storia di un gattino che cantava 'miao miao' sempre."

The text in the sample.txt file is sent to Ollama and is interpreted as being part of the prompt, not part of the file that must be processed.

By separating the text from the prompt, LLaVA can focus on processing the text content independently, leading to more efficient and accurate responses. This approach could be particularly useful for tasks that require extensive text processing, such as translation or summarization.

cmd/cmd.go Outdated Show resolved Hide resolved
cmd/cmd.go Outdated Show resolved Hide resolved
}
defer file.Close()

buf := make([]byte, 512)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes.Buffer is probably more appropriate. you can do something like this where it just appends to the 512

var b bytes.Buffer

if _, err := io.CopyN(&b, file); err != nil {
  // return err
}

contentType := http.DetectContentType(b.Bytes())
if !slices.Contains(types, contentType) {
  // return err
}

if _, err := io.Copy(&b, file); err != nil {
  // return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have to stat the file and only read if the file is < 100MB. I feel like we're splitting hairs here.

cmd/cmd.go Outdated Show resolved Hide resolved
llm/llama.go Outdated Show resolved Hide resolved
Copy link
Contributor

@BruceMacD BruceMacD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship

@@ -902,6 +934,26 @@ func generateInteractive(cmd *cobra.Command, opts generateOptions) error {

if len(prompt) > 0 && multiline == MultilineNone {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can abstract this if conditions to a function so it can be easy to test. nice work btw.

@pdevine pdevine merged commit 910e940 into main Dec 11, 2023
@pdevine pdevine deleted the multimodal branch December 11, 2023 21:56
emsi pushed a commit to emsi/ollama that referenced this pull request Dec 13, 2023
---------

Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants