Adding vision support to api_like_OAI #524
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR tries to address issue 258 (Support OpenAI Vision API).
Documentation (and more formally, code) show that the
contentvalue for messages can either be a string (when only dealing with text) or a list of text/image "content parts". Usually one text part is provided and possibly more than one image, passed as a URL or as a data URL) holding both its format and its base64-encoded content.Llama.cpp expects all text inside the
promptfield, and visual content in animage_datalist where each element is a (data,id) pair.dataholds the base64-encoded image (just the plain content, no data URL) whileidis a numerical id that can be referred to in the text as[img-<id>](see example here).What this code does
If
contentis a string, fall back to previous behavior. Otherwise append all the text items to the prompt and all the images to animage_datalist of dictionaries, where each element contains a base64-encoded image and an id (from 1 up).Differently from the llama.cpp web ui (that only accepts one image with default id=10), more than one image can be uploaded. Experiments with llava-v1.5-7b-Q4_K.gguf show that when one image is uploaded things work properly, when more than one is available though results are mixed (see examples below).
What this code does not do
I have not implemented image download from URL, mainly because I was worried about security implications (i.e. potentially allowing third parties to download stuff on wherever the API runs), but I am happy to go ahead and add it if there are no big concerns.
Example
I tested the following code with llava-v1.5-7b-Q4_K.gguf and the two images attached.
Experiments with a single image (triangle):
Experiments with two images (triangle, circle):
Query: What is the content of [img-2]?
Answer: The image shows a small, bright green ball, possibly a plastic ball, with a yellowish tint. The ball is round and has a slightly shiny appearance, giving it a unique and vibrant look.
Query: What are the main differences between [img-1] and [img-2]?
Answer: The image features a large, green, circular object, likely a disc or a round item, with a red outline. It appears to be a close-up, which highlights the green color and the contrasting red background. The green object has a shiny, reflective quality, giving it a clean and polished look. 🤔
Query: What is the content of [img-1]?
The image depicts a green circle or disc with a red triangle on top of it. The green circle is larger than the red triangle, which is situated in the center of the green circle. The red triangle adds a contrasting element to the scene, while the green circle dominates the visual focus. The entire image appears to be quite colorful and vibrant, making it an eye-catching and lively display. 🤔 🤔