# Classification of Prompts using GPT-4o

This script uses **GPT-4o** to classify prompts in the **validation split**, allowing for a more refined correlation analysis. The classification may reveal that **CLIP** has a better understanding of certain prompt categories.

Analysis of results can be found in the file `correlation_vifi_human.ipynb`.

## Classification Categories

1. **Semantic Category**  
   - Object Addition  
   - Object Removal  
   - Object Replacement  
   - Color Changes  
   - Attribute Changes  
   - Scene Changes  
   - Other  

2. **Instruction Type**  
   - Imperative (Command)  
   - Interrogative (Question)  

3. **Visual Impact**  
   - Subtle Edit  
   - Moderate Edit  
   - Drastic Edit  

4. **Object Type**  
   - Animals  
   - People  
   - Furniture/Objects  
   - Vehicles  
   - Food  
   - Background/Scene  
   - Other  


----
The prompt was written according to OpenAI's [prompt instructions](https://platform.openai.com/docs/guides/vision), specifically for visual content.

The OpenAI API Key needs to be specified in the `.env` file.

Install required packages.

In [7]:
!pip install opencv-python -q
!pip install torchvision -q
!pip install torchmetrics -q
!pip install torchmetrics[image] -q
!pip install "torchmetrics[image]" -q
!pip install torch-fidelity -q
!pip install numpy -q
!pip install torchmetrics -q
!pip install torch -q
!pip install openai -q
!pip install python-dotenv -q

In [1]:
import json
import pandas as pd
import re
import numpy as np
import cv2
import torch
import base64
import requests
import openai
import os
import ast

from PIL import Image
from torchvision import transforms
from torchmetrics.image.fid import FrechetInceptionDistance
from io import BytesIO
from openai import OpenAI
from dotenv import load_dotenv

In [3]:
load_dotenv()
# load_dotenv(dotenv_path="/home/jovyan/BA/Github/thesis-edit-evaluation/.env")

# specify model
MODEL = "gpt-4o-2024-08-06"
API_KEY = os.getenv("OPENAI_API_KEY")

In [21]:
system_prompt = """
You will be given instructions of examples within instruction-guided Image Editing. I want you to classify them based on four different aspects:

1. Semantic Category (What does the instruction do?): Choose from: Object Addition, Object Removal, Object Replacement, Color Changes, Attribute Changes, Scene Changes, Other.

2. Instruction Type (How is it phrased?): Choose from: Imperative (Command), Interrogative (Question).

3. Visual Impact (How drastic is the change?): Choose from: Subtle Edit, Moderate Edit, Drastic Edit.

4. Object Type (What is being modified?): Choose from: Animals, People, Furniture/Objects, Vehicles, Food, Background/Scene, Other.

Provide the classification as a list of categories, such as ["Object Addition", "Imperative", "Subtle Edit", "Animals"] - nothing else.

"""


def get_user_prompt(instruction):
    return f"Classify the instruction: {instruction}."

In [25]:
client = OpenAI(api_key=API_KEY)


def call_api(img_original, img_edited, instruction):
    response = client.chat.completions.create(
        model=MODEL,  # "gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": get_user_prompt(instruction)},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_original}"},
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_edited}"},
                    },
                ],
            },
        ],
        max_tokens=300,
    )
    return response.choices[0].message.content

In [26]:
def encode_image(image):
    if isinstance(image, Image.Image):
        buffered = BytesIO()
        image.save(buffered, format="PNG")  # oder ein anderes unterstütztes Format
        return base64.b64encode(buffered.getvalue()).decode("utf-8")
    else:
        raise ValueError("Input must be a PIL Image.")

In [36]:
# define the regex pattern to extract id and turn from the output filename
pattern = r"(\d+)-output(\d+)"

with open(
    "/home/jovyan/BA/Github/thesis-edit-evaluation/experiments/edit_turns.json"
) as f:
    turns = json.load(f)

dev = pd.read_csv("annotations_mean_3.csv")

In [37]:
dev.head()

Unnamed: 0,id,turn,vificlip_score
0,360871,1,25.03125
1,360871,2,31.453125
2,147745,1,15.851562
3,289514,1,29.015625
4,289514,2,23.21875


In [None]:
# init new columns
dev["semantic_category"] = ""
dev["instruction_type"] = ""
dev["visual_impact"] = ""
dev["object_type"] = ""

path = "/home/jovyan/BA/Github/MagicBrush/vifi_format/videos"

for index, row in dev.iterrows():
    current_id = int(row["id"])
    current_turn = int(row["turn"])

    for entry in turns:
        output = entry["output"]
        match = re.search(pattern, output)

        if match:
            found_id = int(match.group(1))  # get id of sample
            found_turn = int(match.group(2))  # get turn of sample

            if found_id == current_id and found_turn == current_turn:
                instruction = entry["instruction"].lower()

                video_path = os.path.join(path, f"{found_id}_{found_turn}.mp4")
                cap = cv2.VideoCapture(video_path)

                ret1, frame1 = cap.read()
                ret2, frame2 = cap.read()
                cap.release()

                if not (ret1 and ret2 and frame1 is not None and frame2 is not None):
                    print(f"Skipping {video_path} due to insufficient frames.")
                    break

                frame1 = Image.fromarray(cv2.cvtColor(frame1, cv2.COLOR_BGR2RGB))
                frame2 = Image.fromarray(cv2.cvtColor(frame2, cv2.COLOR_BGR2RGB))

                input_image_encoded = encode_image(frame1)
                output_image_encoded = encode_image(frame2)

                response = call_api(
                    input_image_encoded, output_image_encoded, instruction
                )
                response = ast.literal_eval(response)

                dev.at[index, "semantic_category"] = response[0]
                dev.at[index, "instruction_type"] = response[1]
                dev.at[index, "visual_impact"] = response[2]
                dev.at[index, "object_type"] = response[3]

In [42]:
dev = dev.drop(columns=["vificlip_score"])
dev.to_csv("categories.csv", index=False)