# Lesson 2 Project: Image Analysis with GPT-4 Vision

## Introduction

Welcome to the second lesson on multimodal AI! Today, you're diving into the fascinating world of image analysis using GPT-4 Vision. Imagine an AI system that can look at an image and describe it in detail, and answer questions about its content. In this lesson, you'll explore the fundamentals of how GPT-4 Vision works, its applications, and its current limitations. You'll gain hands-on experience in using this technology, learning how to prepare images for analysis, make API requests, and interpret the API responses.
By the end of this lesson, you will be able to:

- Implement image analysis using GPT-4 with vision capabilities
- Process and prepare images for API requests
- Interpret and utilize the AI's analysis of image content

These skills will not only give you a deeper understanding of multimodal AI but also equip you with practical knowledge that's highly relevant in your industry.

## Setting Up OpenAI Development Environment

Refer to the Python Crash Course lesson to learn how to set up your OpenAI development environment.

In [1]:
# Install dependencies
!pip install openai pydantic python-dotenv Pillow matplotlib

In [2]:
# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

## Making GPT-4 Vision API Request

### Using Image URLs

To ensure the image is suitable for analysis, it is recommended to download and view the image locally before passing it to GPT-4 Vision. You can use the following code to download an image from a URL:

In [3]:
# Show images in Jupyter Lab

# Import necessary libraries
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

# Set image URL
ramen_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Shoyu_ramen%2C_at_Kasukabe_Station_%282014.05.05%29_1.jpg/1280px-Shoyu_ramen%2C_at_Kasukabe_Station_%282014.05.05%29_1.jpg"

# Fetch the image from the URL
headers = {
    'User-Agent': 'User Agent 1.0'
}
response = requests.get(ramen_image_url, headers=headers, stream=True)
img = Image.open(BytesIO(response.content))

# Display the image
plt.figure(figsize=(12, 8))
plt.imshow(img)
plt.axis('off')
plt.show()

Now, lets make an API request to GPT-4 Vision:

In [4]:
# Use an image URL when analyzing an image with GPT-4 Vision

# Text prompt
prompt = "How much calories are in this food?"

# Model
openai_model = "gpt-4o"

# Creating an API request
response = client.chat.completions.create(
  model=openai_model,
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": ramen_image_url,
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

choice = response.choices[0]
print(choice)

# Extract the content
print(choice.message.content)

### Uploading Base64 Encoded Images

In [5]:
# Convert an image to a base64 encoded image
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "images/fried_rice.png"
base64_image = encode_image(image_path)

# Show the image in Jupyter Lab
img = Image.open(image_path)

plt.imshow(img)
plt.axis('off')
plt.show()

# Upload the base64 encoded image to the OpenAI API server

# Text prompt
prompt = "How much calories are in this food?"

# Model
openai_model = "gpt-4o"

# Creating an API request
response = client.chat.completions.create(
  model=openai_model,
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

choice = response.choices[0]
print(choice)

# Extract the content
print(choice.message.content)

## Using More than One Image

You can upload multiple images to OpenAI and send a text query about those images. For example, you can compare food found in those images.

In [6]:
# Creating an API request consisting of two images

# Text prompt
prompt = "Which food has less calories?"

# Model
openai_model = "gpt-4o"

# Creating an API request
response = client.chat.completions.create(
  model=openai_model,
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": ramen_image_url,
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

choice = response.choices[0]
print(choice)

### Controlling Image Fidelity

GPT-4 Vision allows you to control the fidelity of image processing using the detail parameter. There are three options:

- "low": Faster processing, lower detail
- "high": Slower processing, higher detail
- "auto": The model decides based on the image size

Here's how to use the detail parameter:

In [7]:
# Use the detail parameter when analyzing an image with GPT-4 Vision

# Text prompt
prompt = "How much calories are in this food?"

# Model
openai_model = "gpt-4o"

# Creating an API request
response = client.chat.completions.create(
  model=openai_model,
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": ramen_image_url,
            "detail": "low"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

choice = response.choices[0]
print(choice.message.content)

## Interpreting and Utilizing Results

When interpreting the results from GPT-4 Vision, keep in mind its capabilities and limitations:

- The model excels at general descriptions and identifying objects in images.
- It can understand relationships between objects but may struggle with precise spatial reasoning.
- The model may provide approximate counts for objects in images.
- It may have difficulty with very small text or highly specialized images (e.g., medical scans).

Here's an example of how to extract specific information from the model's response:

In [8]:
# Extracting specific information when analyzing an image from GPT-4 Vision

from pydantic import BaseModel

class FoodCalories(BaseModel):
    total_calories: str
    analysis: str

# Use JSON format to make extracting information easier

# Text prompt
prompt = "How much calories are in this food?"

# Model
openai_model = "gpt-4o-2024-08-06"

# Creating an API request
response = client.beta.chat.completions.parse(
  model=openai_model,
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": ramen_image_url,
            "detail": "low"
          },
        },
      ],
    }
  ],
  response_format=FoodCalories,
  max_tokens=300,
)

choice = response.choices[0]
print(choice.message.content)