New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal support #1216
Multimodal support #1216
Conversation
Here's a demo of multi-modal in action: multimodal3.mov |
To try it out, you can use the |
Hello @pdevine, this is great. Is there a way to have support for .txt ou .cvs files in the same syntaxe in Ollama? (base) igor@macIgor ~ % ollama run llama2-uncensored
is the capacity to read a file in Llava or in Ollama? |
@igorschlum That's a really great idea. Reading the image file comes from the REPL in Ollama and then it is passed to the model runner. It would have to work differently for embeddings or prompt stuffing. I'll need to think about how that could work. |
Is there a way to test this via the REST API? Not sure how to get the image data through {
"model": "pdevine/llava-1.5:13b",
"prompt": "Describe this image {image_index}",
"stream": false,
"image_data": [] // Base64?
} Edit this worked where {
"model": "pdevine/llava-1.5:13b",
"prompt": "Describe this image",
"stream": false,
"image_data": []
} |
Hi @pdevine , I still think it would be beneficial to provide a path for a JSON file to be analyzed by Ollama. When I attempt to use the @(cat method, I encounter an issue when providing a path to a JSON file containing quotes. (See below.) $ ollama run llama2 --verbose "$(cat /Users/igor/fr.json" Please translate in spanish this json preserving the json format and do not change the name of each key, first item of each line of the json |
@igorschlum Definitely agree, but that can be a follow up PR. There are lots of things to consider with that change as I was alluding to before. |
@pdevine I understand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit-level comments but this is looking great!
9a2ac73
to
3090d08
Compare
4e14f3c
to
1c4fdb2
Compare
@@ -150,6 +150,7 @@ PARAMETER <parameter> <parametervalue> | |||
| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be in a follow up PR, we should add a section in api.md
for this 😃
} | ||
|
||
contentType := http.DetectContentType(data) | ||
allowedTypes := []string{"image/jpeg", "image/jpg", "image/svg+xml", "image/png"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this defined by the model? Can it recognize other image types, e.g. bmps, gifs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of possible values returned by DetectContentType
is very short but contain some values not seen here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally there would be metadata from the model which told us the list of supported mimetypes. I made the list a few weeks ago from the llava documentation, but we can explore changing this in the future.
cmd/cmd.go
Outdated
func extractFileNames(input string) (string, []ImageData, error) { | ||
// Regex to match file paths starting with / or ./ and include escaped spaces (\ or %20) | ||
// and followed by more characters and a file extension | ||
regexPattern := `(?:\./|/)[\S\\ ]+?\.(?:jpg|jpeg|png|svg)\b` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on file extensions is not ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question for file extension. If the Model is able to describe a jpg, is it difficult to accept a path to a mp3 for a model that could convert the sound to text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not ideal, but there's not really a great way that I can think of to do this well. The mac drag-and-drop into a text window just inserts the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be nice to give the path to a file in the prompt and Ollama could read the data of that path and send it to the Model to be Handle. In the video you provided 2 weeks ego, you show that. I was wondering, why just pics and not mp3 or json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igorschlum both of those could be possible in the future, but different types of files rely on having support in the library we use to run the LLM.
It also relies on the models themselves to be designed and tuned for those use-cases too. At the moment image-to-text models are easier for us to support.
ctx = context.WithValue(ctx, generateContextKey("context"), []int{}) | ||
cmd.SetContext(ctx) | ||
} | ||
if len(opts.Images) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is a good idea. At its core, a llava model is still a language model and it's possible to interact with it as just a text completion or chat model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking this too! It works very well sans-image, and modern LLMs seem to blend both together so the user can do either with a single model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be transparent for the LLM as the text from the file at the path would be transmeeted to the LLM. I see in the code that the image is transmitted in Base64, so it's why you don't want to use this syntax for XML, csv, json or txt files. For mp3, Base64 is also used.
I'm dreaming to be able to use one way whisper with Ollama and be able to provide mp3 for conversion.
https://huggingface.co/openai/whisper-large-v3
For txt or json, I know I can type in terminal:
ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/song.txt)"
but I'm obliged to return to terminal to prompt and it's not simple for non Unix users.
When I use this command inside Ollama, I get an error:
ollama run llama2
))) please translate in spanish "$(cat /Users/igor/song.txt)"
The command $(cat /Users/igor/song.txt)
is a Unix or Linux command that
uses the cat
command to display the contents of a file located at
/Users/igor/song.txt
.
It would be nice to prompt
))) please translate in spanish this text /Users/igor/song.txt
I would drag and drop the txt, json or csv file on terminal window. Usage would be much more simple as you do on the video up for jpg.
Anyway, you do a super job and I really enjoy Ollama as you do and improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm extremely reluctant to take this out because with the llava model it's pretty much useless until an image is added. I get that you can get it to answer a question, but that feels like a degenerate usecase for the current model. The kicker for me is that there is no indication to the user that they can even add an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLaVA is better at understanding the context of text and images and using that information to answer questions or generate text than Llama2.
if my sample.txt file is:
C'est l'histoire d'un petit chat qui chantait "miaou miaou" tout le temps.
And translate it also in Italian
if my prompt is :
ollama run llama2 --verbose please translate in spanish "$(cat /Users/igor/sample.txt)"
The answer will be:
Spanish: "Es la historia de un pequeño gato que cantaba 'miau miau' todo el tiempo."
Italian: "Questa è la storia di un gattino che cantava 'miao miao' sempre."
The text in the sample.txt file is sent to Ollama and is interpreted as being part of the prompt, not part of the file that must be processed.
By separating the text from the prompt, LLaVA can focus on processing the text content independently, leading to more efficient and accurate responses. This approach could be particularly useful for tasks that require extensive text processing, such as translation or summarization.
} | ||
defer file.Close() | ||
|
||
buf := make([]byte, 512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes.Buffer is probably more appropriate. you can do something like this where it just appends to the 512
var b bytes.Buffer
if _, err := io.CopyN(&b, file); err != nil {
// return err
}
contentType := http.DetectContentType(b.Bytes())
if !slices.Contains(types, contentType) {
// return err
}
if _, err := io.Copy(&b, file); err != nil {
// return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have to stat the file and only read if the file is < 100MB. I feel like we're splitting hairs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ship
@@ -902,6 +934,26 @@ func generateInteractive(cmd *cobra.Command, opts generateOptions) error { | |||
|
|||
if len(prompt) > 0 && multiline == MultilineNone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can abstract this if conditions to a function so it can be easy to test. nice work btw.
--------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>
This PR builds off of @mattapperson's work, but with a more ollama-like UX + API.