# Evals API: Image Inputs

OpenAI’s Evals API now supports image inputs, in its step toward multimodal functionality! API users can use OpenAI's Evals API to evaluate their image use cases to see how their LLM integration is performing and improve it.

In this cookbook, we'll walk through an image example with the Evals API. More specifically, we will use Evals API to evaluate model-generated responses to an image and its corresponding prompt, using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score those model responses against the image and reference answer.

Based on your use case, you might only need the sampling functionality or the model grader, and you can revise what you pass in during the eval and run creation to fit your needs. 

## Dataset

For this example, we will use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of image, prompt, and reference answer data. First, we load the dataset.

In [12]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [None]:
from datasets import load_dataset

dataset = load_dataset("RekaAI/VibeEval")

  from .autonotebook import tqdm as notebook_tqdm


We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs. 

In [10]:
evals_data_source = []

# select the first 5 examples in the dataset to use for this cookbook
for example in dataset["test"].select(range(5)):
    evals_data_source.append({
        "item": {
            "media_url": example["media_url"], # image web URL
            "reference": example["reference"], # reference answer
            "prompt": example["prompt"] # prompt
        }
    })

If you print the data source list, each item should be of a similar form to:

```python
{
  "item": {
    "media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg"
    "reference": "This appears to be a classic Margherita pizza, which has the following ingredients..."
    "prompt": "What ingredients do I need to make this?"
  }
}
```

## Evals Structure

Now that we have our data source and task, we will create our evals. For the evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).


In [11]:
pip install openai

Note: you may need to restart the kernel to use updated packages.


In [22]:
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url="https://api.openai.com/v1",
)

Evals have two parts, the "Eval" and the "Run". In the "Eval", we define the expected structure of the data and the testing criteria (grader). Based on the data that we have compiled, our data source config is as follows:

In [23]:
data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {
          "media_url": { "type": "string" },
          "reference": { "type": "string" },
          "prompt": { "type": "string" }
        },
        "required": ["media_url", "reference", "prompt"]
      },
    "include_sample_schema": True, # enables sampling
}

For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API docs](hhttps://platform.openai.com/docs/api-reference/graders). 

Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. 

**Note**: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string. 

In [24]:
grader_config = {
	    "type": "score_model",
        "name": "Score Model Grader",
        "input":[
            {
                "role": "system",
		        "content": "You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0."
	        },
	        {
		        "role": "user",
		        "content": [{ "type": "input_text", "text": "Prompt: {{ item.prompt }}."},
							{ "type": "input_image", "image_url": "{{ item.media_url }}", "detail": "auto" },
							{ "type": "input_text", "text": "Reference answer: {{ item.reference }}. Model response: {{ sample.output_text }}."}
				]
	        }
		],
		"pass_threshold": 0.9,
	    "range": [0, 1],
	    "model": "o4-mini" # model for grading; check that the model you use supports image inputs
	}

Now, we create the eval object.

In [25]:
eval_object = client.evals.create(
        name="Image Grading",
        data_source_config=data_source_config,
        testing_criteria=[grader_config],
    )

To create the run, we pass in the eval object id and the data source (i.e., the data we compiled earlier) in addition to the chat message trajectory we'd like for sampling to get the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. 

Here's the sampling message trajectory we'll use for this example.

In [26]:
sampling_messages = [{
    "role": "user",
    "type": "message",
    "content": {
        "type": "input_text",
        "text": "{{ item.prompt }}"
      }
  },
  {
    "role": "user",
    "type": "message",
    "content": {
        "type": "input_image",
        "image_url": "{{ item.media_url }}",
        "detail": "auto"
    }
  }]

In [None]:
eval_run = client.evals.runs.create(
        name="Image Input Eval Run",
        eval_id=eval_object.id,
        data_source={
            "type": "responses", # sample using responses API
            "source": {
                "type": "file_content",
                "content": evals_data_source
            },
            "model": "gpt-4o-mini", # model used to generate the response; check that the model you use supports image inputs
            "input_messages": {
                "type": "template", 
                "template": sampling_messages}
        }
    )

When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. 

In [62]:
import pandas as pd

while True:
    run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
    if run.status == "completed" or run.status == "failed": # check if the run is finished
        output_items = client.evals.runs.output_items.list(
            run_id=run.id, eval_id=eval_object.id
        )
        df = pd.DataFrame({
            "prompt": [item.datasource_item["prompt"]for item in output_items],
            "reference": [item.datasource_item["reference"] for item in output_items],
            "model_response": [item.sample.output[0].content for item in output_items],
            "grading_results": [item.results[0]["sample"]["output"][0]["content"]
                                for item in output_items]
        })
        display(df)
        break
    time.sleep(5)

Unnamed: 0,prompt,reference,model_response,grading_results
0,Please provide latex code to replicate this table,Below is the latex code for your table:\n```te...,Here is the LaTeX code to replicate the table ...,"{""steps"":[{""description"":""Check if the model’s..."
1,What ingredients do I need to make this?,"This appears to be a classic Margherita pizza,...",To make a classic Margherita pizza like the on...,"{""steps"":[{""description"":""Compare the model re..."
2,Is this safe for a vegan to eat?,"Based on the image, this dish appears to be a ...",To determine if this dish is safe for a vegan ...,"{""steps"":[{""description"":""The reference answer..."
3,Where was this taken?,This image is of the seafront in San Sebastián...,I can't determine the exact location of the im...,"{""steps"":[{""description"":""Compare model respon..."
4,What is the man in the picture doing?,"The man on the postcard is playing bagpipes, w...",The man in the picture is playing the bagpipes...,"{""steps"":[{""description"":""Compare the model re..."


To see the full output item, such as for the pizza ingredients image, we can do the following. The structure of the output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object).

In [68]:
import json

pizza_item = next(
    item for item in output_items 
    if "What ingredients do I need to make this?" in item.datasource_item["prompt"]
)

print(json.dumps(dict(pizza_item), indent=2, default=str))

{
  "id": "outputitem_68768c0f7658819187d4f128c2e0ff8c",
  "created_at": 1752599567,
  "datasource_item": {
    "prompt": "What ingredients do I need to make this?",
    "media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg",
    "reference": "This appears to be a classic Margherita pizza, which has the following ingredients:\n\n- Pizza Dough: You'll need yeast, flour, salt, and water to make the dough. A simple recipe is 500g of flour, 1 tsp of salt, 1 tbsp of sugar, and about 300ml warm water.\n\n- Tomatoes: Fresh or canned San Marzano tomatoes are traditionally used for their sweet flavor. If using fresh tomatoes, you can blend them into a sauce.\n\n- Mozzarella Cheese: Traditionally mozzarella di bufala campana D.O.P., but Fior di Latte or other fresh mozzarella work well too.\n\n- Basil Leaves: Fresh basil leaves add a burst of flavor.\n\n- Olive Oil: Extra virgin olive oil is drizzled over the pizza before ba

Now, feel free to extend this to your own use cases! Some examples include grading image generation results with our EvalAPI model graders, evaluating your OCR use cases using model sampling, and more. 