# GPT-4o-as-a-judge

Initial experiments using GPT4o for data labeling. Eventually not included in the thesis report.

[Prompt Instruction](https://platform.openai.com/docs/guides/vision) with visual inputs (OpenAI).

In [6]:
!pip install opencv-python -q
!pip install torchvision -q
!pip install torchmetrics -q
!pip install torchmetrics[image] -q
!pip install "torchmetrics[image]" -q
!pip install torch-fidelity -q
!pip install numpy -q
!pip install torchmetrics -q
!pip install torch -q
!pip install openai -q
!pip install python-dotenv -q

In [23]:
import json
import pandas as pd
import re
import numpy as np
import cv2
import torch
import base64
import requests
import openai
import os

from PIL import Image
from torchvision import transforms
from torchmetrics.image.fid import FrechetInceptionDistance
from io import BytesIO
from openai import OpenAI
from dotenv import load_dotenv
from tqdm import tqdm

In [10]:
load_dotenv()
#load_dotenv(dotenv_path="/home/jovyan/BA/Github/thesis-edit-evaluation/.env")

MODEL = "gpt-4o-2024-08-06"
API_KEY = os.getenv("OPENAI_API_KEY")

Define prompts for evaluation, similar like user study.

In [11]:
system_prompt = """
You want to change the content of a specific area within an image. This technique is called text-guided image editing. 

It involves the following elements:
- Original Image: The starting image before any edits.
- Prompt: A text description that specifies the desired change in the original image.
- Mask: A defined area in the image where the change should occur according to the prompt.
- Edited Image: The final image after the desired change has been made.

Your task is to evaluate the quality of these edits by considering three different aspects. 
Each aspect should be rated on a scale from 1 to 10, where 1 indicates "very poor" and 10 means "excellent."

Aspects to Evaluate:
1. Prompt-Image Alignment: 
    - Objective: Assess how well the edited area aligns with the instructions provided in the text prompt.
    - Considerations: Verify if the desired changes are accurately implemented. Pay attention to details such as numbers, colors, and objects mentioned in the prompt.
2. Visual Quality: 
    - Objective: Evaluate the visual appeal of the edited area within the mask, focusing solely on the appearance of the new content within the masked area.
    - Considerations: Assess realism and aesthetics, including color accuracy and overall visual coherence.
3. Consistency Between Original Image and Edited Area: 
    - Objective: Measure how well the edit integrates with the original image.
    - Considerations: Examine consistency in style, lighting, logic, and spatial coherence between the edited area and the original image.
4. Overall Rating:
    - After evaluating each aspect individually, provide an overall rating of the entire edited image. Consider how you perceive and like the edit as a whole, how well it meets your expectations and integrates with the original image. 


Input and Output:
- Input: The evaluation will be based on the following items:
    - Original image
    - Text prompt
    - Image with a masked area
    - Edited image
- Output: Provide your ratings in the following JSON format. Fill "score" keys with numerical values.
{
  "alignment": "",
  "visual_quality": "",
  "consistency": "",
  "overall": ""
}

Additional Instructions:
- Careful justification: Think carefully about your ratings and. Avoid providing ratings without thoughtful consideration.
- Output: Do not include anything other than the JSON file in your response.

"""

def get_user_prompt(instruction):
    return f"Evaluate the quality given the following prompt: {instruction}."

Additionally to prompt, input images.

In [12]:
client = OpenAI(api_key=API_KEY)

def call_api(prompt, img_original, img_mask, img_edited, instruction):
    response = client.chat.completions.create(
        model= MODEL, #"gpt-4o", 
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text", 
                        "text": get_user_prompt(instruction)
                    },
                    {
                        "type": "image_url", 
                        "image_url": {
                            "url": f'data:image/png;base64,{img_original}'
                        },
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{img_mask}"
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{img_edited}"
                        }
                    },
                ]
            }
        ],
        max_tokens=300,
    )
    return response.choices[0].message.content

In [13]:
columns = ['model', 'id', 'turn', 'alignment', 'visual_quality', 'consistency', 'overall']
df_gpt = pd.DataFrame(columns=columns)

In [14]:
def encode_image(image):
    if isinstance(image, Image.Image):
        buffered = BytesIO()
        image.save(buffered, format="PNG")  # oder ein anderes unterst√ºtztes Format
        return base64.b64encode(buffered.getvalue()).decode('utf-8')
    else:
        raise ValueError("Input must be a PIL Image.")

GPT to judge all samples within the dev split.

In [20]:
dev_split = pd.read_csv("/home/jovyan/BA/Github/MagicBrush/dev_data_with_mask.csv")

In [27]:
pattern = r'(\d+)-output(\d+)'
target_size = (512,512)

for index, row in tqdm(dev_split.iterrows(), desc="Progress"):
    id = row["img_id"]
    turn = row["turn_index"]
    instruction = row["instruction"]

    # open images, resize to 512x512, and convert to RGB
    output_image = Image.open(row["target_img"]).resize(target_size).convert('RGB')
    input_image = Image.open(row["source_img"]).resize(target_size).convert('RGB')
    mask_image = Image.open(row["mask_img"]).resize(target_size).convert('RGB')

    output_array = np.array(output_image)
    mask_array = np.array(mask_image)
    masked_area = cv2.absdiff(output_array, mask_array)

    to_tensor = transforms.ToTensor()
    masked_area_tensor = to_tensor(masked_area).unsqueeze(0)
    masked_area_image = transforms.ToPILImage()(masked_area_tensor.squeeze(0))

    input_image_encoded = encode_image(input_image)
    masked_area_encoded = encode_image(masked_area_image)
    output_image_encoded = encode_image(output_image)

    response = call_api(
        API_KEY,
        input_image_encoded,
        masked_area_encoded, 
        output_image_encoded,
        instruction
    )

    response = json.loads(response)
    print(response)

    new_row = pd.DataFrame({
        "model": [MODEL],
        "id": [id],
        "turn": [turn], 
        "alignment": [response.get("alignment", None)],
        "visual_quality": [response.get("visual_quality", None)],
        "consistency": [response.get("consistency", None)], 
        "overall": [response.get("overall", None)]
    })

    df_gpt = pd.concat([df_gpt, new_row], ignore_index=True)

Progress: 1it [00:03,  3.84s/it]

{'alignment': '8', 'visual_quality': '8', 'consistency': '9', 'overall': '8'}





In [102]:
df_gpt.to_csv("gpt_scores.csv")