Adding vision support to api_like_OAI #524

aittalam · 2024-08-03T23:44:09Z

This PR tries to address issue 258 (Support OpenAI Vision API).

Documentation (and more formally, code) show that the content value for messages can either be a string (when only dealing with text) or a list of text/image "content parts". Usually one text part is provided and possibly more than one image, passed as a URL or as a data URL) holding both its format and its base64-encoded content.

Llama.cpp expects all text inside the prompt field, and visual content in an image_data list where each element is a (data,id) pair. data holds the base64-encoded image (just the plain content, no data URL) while id is a numerical id that can be referred to in the text as [img-<id>] (see example here).

What this code does

If content is a string, fall back to previous behavior. Otherwise append all the text items to the prompt and all the images to an image_data list of dictionaries, where each element contains a base64-encoded image and an id (from 1 up).
Differently from the llama.cpp web ui (that only accepts one image with default id=10), more than one image can be uploaded. Experiments with llava-v1.5-7b-Q4_K.gguf show that when one image is uploaded things work properly, when more than one is available though results are mixed (see examples below).

What this code does not do

I have not implemented image download from URL, mainly because I was worried about security implications (i.e. potentially allowing third parties to download stuff on wherever the API runs), but I am happy to go ahead and add it if there are no big concerns.

Example

I tested the following code with llava-v1.5-7b-Q4_K.gguf and the two images attached.

import base64
import requests
import os

# OpenAI API Key
api_key = os.environ["OPENAI_API_KEY"]

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "triangle.gif"
image_path2 = "circle.gif"

# Getting the base64 string
base64_image = encode_image(image_path)
base64_image2 = encode_image(image_path2)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is the content of [img-1]?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/gif;base64,{base64_image}"
          }
        },
        # this part can be removed to upload only the first image
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/gif;base64,{base64_image2}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

#response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
response = requests.post("http://127.0.0.1:8081/v1/chat/completions", headers=headers, json=payload)

print(response.json())

Experiments with a single image (triangle):

Query: What is the content of [img-1]?
Answer: The image depicts a red triangle, which is a universally recognized symbol for warning or danger. It is displayed prominently in the center of the scene with a white background, making it stand out clearly. The triangle symbolizes caution and alertness, and is often found on road signs, safety equipment, and other warning systems to communicate important information to users or passersby.

Experiments with two images (triangle, circle):

Query: What is the content of [img-2]?
Answer: The image shows a small, bright green ball, possibly a plastic ball, with a yellowish tint. The ball is round and has a slightly shiny appearance, giving it a unique and vibrant look.
Query: What are the main differences between [img-1] and [img-2]?
Answer: The image features a large, green, circular object, likely a disc or a round item, with a red outline. It appears to be a close-up, which highlights the green color and the contrasting red background. The green object has a shiny, reflective quality, giving it a clean and polished look. 🤔
Query: What is the content of [img-1]?
The image depicts a green circle or disc with a red triangle on top of it. The green circle is larger than the red triangle, which is situated in the center of the green circle. The red triangle adds a contrasting element to the scene, while the green circle dominates the visual focus. The entire image appears to be quite colorful and vibrant, making it an eye-catching and lively display. 🤔 🤔

jart

Thank you!

Added vision support to api_like_OAI

33ad01e

github-actions bot added the llama.cpp label Aug 3, 2024

aittalam changed the title ~~Added vision support to api_like_OAI~~ Adding vision support to api_like_OAI Aug 3, 2024

jart force-pushed the main branch from e57f3ab to f59f085 Compare September 28, 2024 16:18

jart approved these changes Sep 30, 2024

View reviewed changes

jart merged commit d617c0b into mozilla-ai:main Sep 30, 2024

sefgit mentioned this pull request Jun 2, 2025

add api_like_OAI feature to support file and image context #766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding vision support to api_like_OAI #524

Adding vision support to api_like_OAI #524

Uh oh!

aittalam commented Aug 3, 2024

Uh oh!

jart left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding vision support to api_like_OAI #524

Adding vision support to api_like_OAI #524

Uh oh!

Conversation

aittalam commented Aug 3, 2024

What this code does

What this code does not do

Example

Uh oh!

jart left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants