Skip to content

Conversation

@aittalam
Copy link
Member

@aittalam aittalam commented Aug 3, 2024

This PR tries to address issue 258 (Support OpenAI Vision API).

Documentation (and more formally, code) show that the content value for messages can either be a string (when only dealing with text) or a list of text/image "content parts". Usually one text part is provided and possibly more than one image, passed as a URL or as a data URL) holding both its format and its base64-encoded content.

Llama.cpp expects all text inside the prompt field, and visual content in an image_data list where each element is a (data,id) pair. data holds the base64-encoded image (just the plain content, no data URL) while id is a numerical id that can be referred to in the text as [img-<id>] (see example here).

What this code does

If content is a string, fall back to previous behavior. Otherwise append all the text items to the prompt and all the images to an image_data list of dictionaries, where each element contains a base64-encoded image and an id (from 1 up).
Differently from the llama.cpp web ui (that only accepts one image with default id=10), more than one image can be uploaded. Experiments with llava-v1.5-7b-Q4_K.gguf show that when one image is uploaded things work properly, when more than one is available though results are mixed (see examples below).

What this code does not do

I have not implemented image download from URL, mainly because I was worried about security implications (i.e. potentially allowing third parties to download stuff on wherever the API runs), but I am happy to go ahead and add it if there are no big concerns.

Example

circle
triangle

I tested the following code with llava-v1.5-7b-Q4_K.gguf and the two images attached.

import base64
import requests
import os

# OpenAI API Key
api_key = os.environ["OPENAI_API_KEY"]

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "triangle.gif"
image_path2 = "circle.gif"

# Getting the base64 string
base64_image = encode_image(image_path)
base64_image2 = encode_image(image_path2)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is the content of [img-1]?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/gif;base64,{base64_image}"
          }
        },
        # this part can be removed to upload only the first image
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/gif;base64,{base64_image2}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

#response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
response = requests.post("http://127.0.0.1:8081/v1/chat/completions", headers=headers, json=payload)

print(response.json())

Experiments with a single image (triangle):

  • Query: What is the content of [img-1]?
  • Answer: The image depicts a red triangle, which is a universally recognized symbol for warning or danger. It is displayed prominently in the center of the scene with a white background, making it stand out clearly. The triangle symbolizes caution and alertness, and is often found on road signs, safety equipment, and other warning systems to communicate important information to users or passersby.

Experiments with two images (triangle, circle):

  • Query: What is the content of [img-2]?

  • Answer: The image shows a small, bright green ball, possibly a plastic ball, with a yellowish tint. The ball is round and has a slightly shiny appearance, giving it a unique and vibrant look.

  • Query: What are the main differences between [img-1] and [img-2]?

  • Answer: The image features a large, green, circular object, likely a disc or a round item, with a red outline. It appears to be a close-up, which highlights the green color and the contrasting red background. The green object has a shiny, reflective quality, giving it a clean and polished look. 🤔

  • Query: What is the content of [img-1]?

  • The image depicts a green circle or disc with a red triangle on top of it. The green circle is larger than the red triangle, which is situated in the center of the green circle. The red triangle adds a contrasting element to the scene, while the green circle dominates the visual focus. The entire image appears to be quite colorful and vibrant, making it an eye-catching and lively display. 🤔 🤔

@aittalam aittalam changed the title Added vision support to api_like_OAI Adding vision support to api_like_OAI Aug 3, 2024
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants