# AI Art Gallery Tour Guide

## Project Description
This project implements an intelligent art gallery tour guide system using Large Language Models (LLMs) and computer vision capabilities. The system generates detailed, contextual explanations for artworks by combining structured metadata with visual analysis.

### Key Components

1. **Baseline Implementation**
   - Developed a prompt engineering framework for artwork description generation using Gemini API
   - Created structured prompts incorporating artwork metadata (artist info, historical context, physical details)
   - Implemented a comprehensive evaluation framework with structured ratings (1-5 scale)

2. **Generation Control & Optimization**
   - Implemented length control (200 tokens) for more concise explanations
   - Analyzed the relationship between response length and quality ratings
   - Monitored and evaluated generation quality across different length constraints

3. **Multimodal Integration**
   - Enhanced the system with image processing capabilities using Gemini's multimodal features
   - Combined textual metadata with visual analysis for richer artwork descriptions
   - Developed a specialized prompt template that balances textual and visual information

4. **Evaluation Framework**
   - Created a sophisticated evaluation system assessing multiple criteria:
     - Instruction following
     - Factual accuracy
     - Artwork focus/relevance
     - Analytical depth
     - Clarity and cohesion
     - Contextualization
   - Implemented structured output for systematic quality assessment
   - Analyzed bad generations to identify and address common issues


# Part 0: Environment Setup

In [None]:
# # Install all required packages
# !pip install python-dotenv pandas google-generativeai requests tqdm aiohttp

# # Restart kernel after installation to use newly installed packages
# from IPython.core.display import HTML
# HTML("<script>Jupyter.notebook.kernel.restart()</script>")

The first step is to create an '.env' file in your project folder and add the following line:

GEMINI_API_KEY=your_api_key_here


In [1]:
# Import all required packages and read gemini api key from .env file
# Environment and API setup
import os
from dotenv import load_dotenv

# Data processing and analysis
import pandas as pd
import json
from datetime import datetime

# Google Gemini API
from google import genai
from google.genai import types

# HTTP requests and image handling
import requests

# Utilities
import time
import random
import enum
from tqdm import tqdm  # Progress tracking

# Load environment variables
load_dotenv()

# Initialize Gemini API
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
if not GEMINI_API_KEY:
    raise ValueError("GEMINI_API_KEY not found in environment variables")

# Configure pandas display options
pd.set_option('display.max_colwidth', None)  # Show full column content

# Part 1: Data Processing

## Data Loading & EDA

data provided by The Metropolitan Museum of Art 

https://github.com/metmuseum/openaccess?tab=readme-ov-file

We select those art works that are in the public domain(Is Public Domain=True) and have images available. 

We concat the image url for each selected art work through api calls (https://metmuseum.github.io/)

The code for this process is in fetch_met_images.py, which is not shown in this notebook. The resulting data file is store in data/met_with_images.csv, which is provided in the data folder.

In [2]:
# red Met data
file_path = '/Users/yas063/Desktop/LLM/Google_Kaggle_GenAI_Bootcamp/Capstone Project/data/met_with_images.csv'

df = pd.read_csv(file_path)
df.head()

  df = pd.read_csv(file_path)


Unnamed: 0,Object Number,Is Highlight,Is Timeline Work,Is Public Domain,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,...,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Metadata Date,Repository,Tags,Tags AAT URL,Tags Wikidata URL,Primary Image URL
0,1970.289.6,False,False,True,34,774.0,The American Wing,1970.0,Clock,Acorn Clock,...,,,http://www.metmuseum.org/art/collection/search/34,https://www.wikidata.org/wiki/Q116373732,,"Metropolitan Museum of Art, New York, NY",Landscapes|Boats,http://vocab.getty.edu/page/aat/300132294|http://vocab.getty.edu/page/aat/300178749,https://www.wikidata.org/wiki/Q191163|https://www.wikidata.org/wiki/Q35872,https://images.metmuseum.org/CRDImages/ad/original/204788.jpg
1,38.165.51,False,False,True,37,774.0,The American Wing,1938.0,Figure,Figure of Admiral George Rodney,...,,,http://www.metmuseum.org/art/collection/search/37,https://www.wikidata.org/wiki/Q116373729,,"Metropolitan Museum of Art, New York, NY",Cannons|Swords|Men,http://vocab.getty.edu/page/aat/300036936|http://vocab.getty.edu/page/aat/300037048|http://vocab.getty.edu/page/aat/300025928,https://www.wikidata.org/wiki/Q81103|https://www.wikidata.org/wiki/Q12791|https://www.wikidata.org/wiki/Q8441,https://images.metmuseum.org/CRDImages/ad/original/DP247752.jpg
2,38.165.50,False,False,True,38,774.0,The American Wing,1938.0,Figure,Figure of Admiral Samuel Hood,...,,,http://www.metmuseum.org/art/collection/search/38,https://www.wikidata.org/wiki/Q116373728,,"Metropolitan Museum of Art, New York, NY",Cannons|Swords|Men|Admirals,http://vocab.getty.edu/page/aat/300036936|http://vocab.getty.edu/page/aat/300037048|http://vocab.getty.edu/page/aat/300025928|http://vocab.getty.edu/page/aat/300236014,https://www.wikidata.org/wiki/Q81103|https://www.wikidata.org/wiki/Q12791|https://www.wikidata.org/wiki/Q8441|https://www.wikidata.org/wiki/Q132851,https://images.metmuseum.org/CRDImages/ad/original/DP247753.jpg
3,18.11.10,False,False,True,39,,The American Wing,1918.0,Advertisement,Advertisement for Norwich Stone Ware Factory,...,,,http://www.metmuseum.org/art/collection/search/39,,,"Metropolitan Museum of Art, New York, NY",Advertisements,http://vocab.getty.edu/page/aat/300193993,https://www.wikidata.org/wiki/Q39911916,https://images.metmuseum.org/CRDImages/ad/original/37808.jpg
4,46.140.143,False,False,True,40,774.0,The American Wing,1946.0,Ale glass,Ale Glass,...,,,http://www.metmuseum.org/art/collection/search/40,https://www.wikidata.org/wiki/Q116373727,,"Metropolitan Museum of Art, New York, NY",,,,https://images.metmuseum.org/CRDImages/ad/original/174118.jpg


In [3]:
# print the shape of the dataframe
print('df shape:', df.shape)

df shape: (248472, 55)


In [6]:
# only keep the rows with image url not empty
df = df[df['Primary Image URL'].notna()]
# print the shape of the dataframe
print('df shape:', df.shape)

df shape: (247565, 55)


In [7]:
# print column names, there are 55 feature columns
print("Column names in the dataset:")
for i, col in enumerate(df.columns):
    print(f"{i+1}. {col}")

Column names in the dataset:
1. Object Number
2. Is Highlight
3. Is Timeline Work
4. Is Public Domain
5. Object ID
6. Gallery Number
7. Department
8. AccessionYear
9. Object Name
10. Title
11. Culture
12. Period
13. Dynasty
14. Reign
15. Portfolio
16. Constituent ID
17. Artist Role
18. Artist Prefix
19. Artist Display Name
20. Artist Display Bio
21. Artist Suffix
22. Artist Alpha Sort
23. Artist Nationality
24. Artist Begin Date
25. Artist End Date
26. Artist Gender
27. Artist ULAN URL
28. Artist Wikidata URL
29. Object Date
30. Object Begin Date
31. Object End Date
32. Medium
33. Dimensions
34. Credit Line
35. Geography Type
36. City
37. State
38. County
39. Country
40. Region
41. Subregion
42. Locale
43. Locus
44. Excavation
45. River
46. Classification
47. Rights and Reproduction
48. Link Resource
49. Object Wikidata URL
50. Metadata Date
51. Repository
52. Tags
53. Tags AAT URL
54. Tags Wikidata URL
55. Primary Image URL


In [8]:
# distribution of department: the majority are Drawings and Prints
print(df['Department'].value_counts())

Department
Drawings and Prints                          65379
European Sculpture and Decorative Arts       33783
Asian Art                                    30709
Greek and Roman Art                          29819
Islamic Art                                  13220
Egyptian Art                                 12181
The American Wing                            11792
Costume Institute                             8298
Arms and Armor                                7075
Medieval Art                                  6917
Photographs                                   6415
Arts of Africa, Oceania, and the Americas     6318
Ancient Near Eastern Art                      6177
European Paintings                            2322
Robert Lehman Collection                      2272
The Cloisters                                 2269
Musical Instruments                           2269
Modern and Contemporary Art                    203
The Libraries                                  147
Name: count, dtype: 

For demonstration purposes, we will use a sample of 50 artworks from the Drawings and Prints Department.

In [9]:
# sample 50 data points from Drawings and Prints department
sample_df = df[df['Department'] == 'Drawings and Prints'].sample(50)

# save the sample dataframe to a csv file
sample_df.to_csv('data/sample50_df_drawings_and_prints.csv', index=False)

## Data Cleaning and Feature Selection

In [17]:
# Load sampled data
file_path = 'data/sample50_df_drawings_and_prints.csv'

df = pd.read_csv(file_path)
print('df shape:', df.shape)

df shape: (50, 55)


In [20]:
# Drop features columns with more than 70% null values
# Number of features is reduced from 55 to 35.

df_cleaned = df.dropna(thresh=0.3 * len(df), axis=1)
df_cleaned.info()
# print the shape of the dataframe
print('df shape:', df_cleaned.shape)
# print the column names
print("Remaining column names in the dataset:")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Object Number        50 non-null     object 
 1   Is Highlight         50 non-null     bool   
 2   Is Timeline Work     50 non-null     bool   
 3   Is Public Domain     50 non-null     bool   
 4   Object ID            50 non-null     int64  
 5   Department           50 non-null     object 
 6   AccessionYear        50 non-null     float64
 7   Object Name          50 non-null     object 
 8   Title                50 non-null     object 
 9   Constituent ID       49 non-null     object 
 10  Artist Role          49 non-null     object 
 11  Artist Prefix        49 non-null     object 
 12  Artist Display Name  49 non-null     object 
 13  Artist Display Bio   49 non-null     object 
 14  Artist Suffix        49 non-null     object 
 15  Artist Alpha Sort    49 non-null     objec

In [21]:
# Have a look at the content of each feature column by reading the first row of the dataframe and print it as json
artwork_data = df_cleaned.iloc[0].to_json()
artwork_data = json.loads(artwork_data)
artwork_data

{'Object Number': '32.62.19',
 'Is Highlight': False,
 'Is Timeline Work': False,
 'Is Public Domain': True,
 'Object ID': 333979,
 'Department': 'Drawings and Prints',
 'AccessionYear': 1932.0,
 'Object Name': 'Print',
 'Title': 'Plate 11 from "The Disasters of War" (Los Desastres de la Guerra): \'Neither do these\' (Ni por esas)',
 'Constituent ID': '16195',
 'Artist Role': 'Artist',
 'Artist Prefix': ' ',
 'Artist Display Name': 'Goya (Francisco de Goya y Lucientes)',
 'Artist Display Bio': 'Spanish, Fuendetodos 1746–1828 Bordeaux',
 'Artist Suffix': ' ',
 'Artist Alpha Sort': 'Goya (Francisco de Goya y Lucientes)',
 'Artist Nationality': 'Spanish',
 'Artist Begin Date': '1746      ',
 'Artist End Date': '1828      ',
 'Artist Gender': None,
 'Artist ULAN URL': 'http://vocab.getty.edu/page/ulan/500118936',
 'Artist Wikidata URL': 'https://www.wikidata.org/wiki/Q5432',
 'Object Date': 'ca. 1810',
 'Object Begin Date': 1805,
 'Object End Date': 1815,
 'Medium': 'Etching, drypoint, bur

**Observations:**

Analyzing the data structure, we can categorize the features into these groups:

1. Artwork-Related Features
- title, object name, date, medium, dimensions
- These features are essential for generating descriptive prompts
- Provide context about the physical attributes and historical period

2. Artist-Related Features
- name, biography, nationality, lifespan
- Critical for generating contextual prompts
- Enable historical and cultural context in descriptions

3. System Features and References
- object IDs, accession numbers, URLs
- Limited utility for text generation
- Valuable for:
  - RAG system queries
  - External resource linking
  - Image retrieval for supplementary visual analysis

**Using AI to Identify Informative Features**

Let's leverage Gemini-2.5-Pro to analyze our dataset and determine the most relevant columns for artwork description generation.

Below is what Gemini thinks are the most important columns for our task:

In [22]:
# --- Column Selection ---

# Define Tier 1 columns (Essential)
# Note: Included individual artist date/nationality fields as they are part of the bio context
tier1_cols = [
    'Title',
    'Artist Display Name',
    'Artist Display Bio',
    'Artist Nationality', # Component of Bio
    'Artist Begin Date',  # Component of Bio
    'Artist End Date',    # Component of Bio
    'Object Date',
    'Object Begin Date', # Included as it specifies the date range start
    'Object End Date',   # Included as it specifies the date range end
    'Medium',
    'Tags',
    'Primary Image URL'
]

# Define Tier 2 columns (Highly Useful)
tier2_cols = [
    'Dimensions',
    'Department',
    'Object Name',
    'Classification',
    'Repository',
    'Tags AAT URL',
    'Tags Wikidata URL',
    'Artist Role'
]

# Define ID columns
id_cols = [
    'Object ID',
    'Object Number', # Often used as a primary museum identifier
    'Constituent ID'
]

# Combine all desired columns into a single list
# Using set to automatically handle potential duplicates if any column was listed twice
all_desired_cols = list(set(id_cols + tier1_cols + tier2_cols))

In [23]:
# We take Gemini's suggestion and keep only the 23 desired columns
df_cleaned_selected_gemini = df_cleaned[all_desired_cols]
# print the shape of the dataframe
print('df_cleaned_selected shape:', df_cleaned_selected_gemini.shape)

df_cleaned_selected shape: (50, 23)


In [24]:
# save the dataframe after cleaning and feature selection into a csv file
df_cleaned_selected_gemini.to_csv('data/sample50_drawings_and_prints_cleaned.csv', index=False)

# Part 2: Baseline Generation

For baseline implementation, we perform these steps:
- Select a few basic features exclude ids and urls.
- Ceate a structured prompt template.
- Use Gemini-2.0-Flash API to generate detailed artwork explanations.

(For later optimization, we will add image url for generation)

In [25]:
# load the selected dataframe
df_selected = pd.read_csv('data/sample50_drawings_and_prints_cleaned.csv')

In [26]:
# print the column names
print(df_selected.columns)

Index(['Object Date', 'Department', 'Dimensions', 'Object End Date', 'Medium',
       'Object Number', 'Object Name', 'Title', 'Primary Image URL',
       'Classification', 'Constituent ID', 'Repository', 'Tags AAT URL',
       'Artist Display Bio', 'Tags', 'Artist End Date', 'Object Begin Date',
       'Artist Display Name', 'Artist Role', 'Object ID', 'Artist Begin Date',
       'Tags Wikidata URL', 'Artist Nationality'],
      dtype='object')


Select a few basic features exclude ids and urls.

In [27]:
# For baseline, we do not use id and url into the prompt
url_cols = ['Primary Image URL','Tags AAT URL','Tags Wikidata URL']
id_cols = ['Object ID','Object Number','Constituent ID']

baseline_cols = list(set(df_selected.columns) - set(url_cols) - set(id_cols))
baseline_cols

['Artist End Date',
 'Object Begin Date',
 'Department',
 'Object Date',
 'Artist Display Name',
 'Artist Role',
 'Dimensions',
 'Medium',
 'Object End Date',
 'Object Name',
 'Title',
 'Artist Begin Date',
 'Classification',
 'Repository',
 'Artist Display Bio',
 'Artist Nationality',
 'Tags']

Ceate a structured prompt template.

In [28]:
# --- Prompt Generation ---

# Define the prompt template using f-string syntax
# It includes placeholders for all columns we intend to use
# Note: We access columns present in 'existing_cols' which are the intersection of 'all_desired_cols' and df.columns
prompt_template = """
You are a helpful tour guide in an art gallery. 
Please provide an explanation and insights for the following artwork to the visitor, using the details provided:

**Artwork Details:**
{details}

**Instructions:**
Based on these details, please generate a concise explanation. Focus on:
1.  The artist's background and context (using bio, nationality, dates).
2.  The artwork's subject matter (using title and tags).
3.  The historical period and significance (using object dates).
4.  The materials and techniques used (using medium and classification).
5.  Mention its size (dimensions) and where it is located (repository).
"""

# List to store the generated prompts
generated_prompts = []

print("\n--- Generating Prompts ---")

# Iterate through each row of the extracted DataFrame
for index, row in df_selected.iterrows():
    # Create the details string for the current artwork
    details_str = ""
    for col_name in baseline_cols: # Iterate only through 
        # Ensure the value is converted to string, handle potential None/NaN values
        value = str(row[col_name]) if pd.notna(row[col_name]) else "N/A"
        details_str += f"* **{col_name}:** {value}\n"

    # Format the full prompt using the template and the generated details string
    full_prompt = prompt_template.format(details=details_str.strip())
    generated_prompts.append(full_prompt)


--- Generating Prompts ---


In [29]:
# --- Display Example Prompt ---
if generated_prompts:
    print("\nExample of a generated prompt (for the first artwork):")
    print(generated_prompts[0])
    print(f"\nTotal prompts generated: {len(generated_prompts)}")
else:
    print("\nNo prompts were generated (DataFrame might be empty or columns missing).")


Example of a generated prompt (for the first artwork):

You are a helpful tour guide in an art gallery. 
Please provide an explanation and insights for the following artwork to the visitor, using the details provided:

**Artwork Details:**
* **Artist End Date:** 1828      
* **Object Begin Date:** 1805
* **Department:** Drawings and Prints
* **Object Date:** ca. 1810
* **Artist Display Name:** Goya (Francisco de Goya y Lucientes)
* **Artist Role:** Artist
* **Dimensions:** Plate: 6 5/16 × 8 3/8 in. (16.1 × 21.2 cm)
Sheet: 8 3/8 × 12 3/4 in. (21.3 × 32.4 cm)
* **Medium:** Etching, drypoint, burin (working proof)
* **Object End Date:** 1815
* **Object Name:** Print
* **Title:** Plate 11 from "The Disasters of War" (Los Desastres de la Guerra): 'Neither do these' (Ni por esas)
* **Artist Begin Date:** 1746      
* **Classification:** Prints
* **Repository:** Metropolitan Museum of Art, New York, NY
* **Artist Display Bio:** Spanish, Fuendetodos 1746–1828 Bordeaux
* **Artist Nationality:*

Use Gemini-2.0-Flash API to generate detailed artwork explanations.

In [30]:
# Generate the explanations, wait for 1 minute after every 15 calls because of the rate limit
import time
from google import genai

client = genai.Client(api_key=GEMINI_API_KEY)

responses = []
for i, prompt in enumerate(generated_prompts):
    try:
        # Add 1 minute wait after every 15 calls
        if i > 0 and i % 15 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming generation...")
            
        response = client.models.generate_content(
            model="gemini-2.0-flash", contents=prompt
        )
        responses.append(response.text)
        print(f"Generated explanation for artwork {i + 1}/{len(generated_prompts)}")
        
    except Exception as e:
        print(f"Error generating explanation for artwork {i + 1}: {e}")
        responses.append(f"Error: {e}")

Generated explanation for artwork 1/50
Generated explanation for artwork 2/50
Generated explanation for artwork 3/50
Generated explanation for artwork 4/50
Generated explanation for artwork 5/50
Generated explanation for artwork 6/50
Generated explanation for artwork 7/50
Generated explanation for artwork 8/50
Generated explanation for artwork 9/50
Generated explanation for artwork 10/50
Generated explanation for artwork 11/50
Generated explanation for artwork 12/50
Generated explanation for artwork 13/50
Generated explanation for artwork 14/50
Generated explanation for artwork 15/50

Waiting for 1 minute after 15 calls...
Resuming generation...
Generated explanation for artwork 16/50
Generated explanation for artwork 17/50
Generated explanation for artwork 18/50
Generated explanation for artwork 19/50
Generated explanation for artwork 20/50
Generated explanation for artwork 21/50
Generated explanation for artwork 22/50
Generated explanation for artwork 23/50
Generated explanation for 

In [31]:
# Combine prompts and responses into a structured format

# Create a list of dictionaries containing both prompts and responses
combined_data = []
for i, (prompt, response) in enumerate(zip(generated_prompts, responses)):
    combined_data.append({
        'artwork_index': i + 1,
        'prompt': prompt,
        'response': response
    })

# Convert to DataFrame
df_output = pd.DataFrame(combined_data)

# Save as CSV
csv_filename = 'results/baseline/prompts_and_generation.csv'
df_output.to_csv(csv_filename, index=False)

# Part 3 Evaluation with Structured Output

For this part we :
1. Create a evaluation prompt and use llm to evaluate the quality of the generated explanation.
2. Structure the rating score into 1 to 5.

In [32]:
# Define the evaluation prompt
EVALUATION_PROMPT = """\
# Instruction
You are an expert evaluator, knowledgeable in art history and analysis. Your task is to evaluate the quality of the responses generated by AI models that provide explanations for artworks.
We will provide you with the user input (which should ideally contain information about the artwork, like its title, artist, or a description) and the AI-generated response.
You should first read the user input carefully to understand the task and identify the subject artwork from the information provided in the prompt. Then, evaluate the quality of the AI response based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing Artwork Explanation Quality. This measures the AI's ability to provide an accurate, insightful, relevant, and clear explanation of a piece of art, as identified or described in the user prompt. The evaluation should consider how well the explanation addresses the user's specific prompt and relates directly to the artwork in question.

## Criteria
1. Instruction Following: The response directly addresses the user's prompt, adhering to any specified constraints (e.g., focus on symbolism, explain to a child, compare two works).
2. Accuracy: The information presented (artist, date, period, style, technique, common interpretations) is factually correct based on established art historical knowledge for the artwork identified in the prompt.
3. Artwork Focus/Relevance: The explanation is clearly and specifically about the artwork identified in the prompt. It avoids overly generic statements and connects observations directly to the visual elements or known context of the piece.
4. Analytical Depth: The response goes beyond superficial description. It offers insights into technique, composition, symbolism, historical context, or artistic intent, demonstrating analytical thinking relevant to the identified artwork.
5. Clarity and Cohesion: The explanation is well-organized, uses clear and understandable language, and flows logically.
6. Contextualization: The response appropriately places the artwork within its relevant context (e.g., artist's life, art movement, historical period, cultural background) when necessary for understanding, based on the artwork identified in the prompt.

## Rating Rubric
5 (Very Good): Excels in all criteria. Accurate, insightful, highly relevant to the artwork in the prompt, clearly written, follows instructions perfectly, and provides strong context. Demonstrates a nuanced understanding.
4 (Good): Strong performance. Accurate, relevant, and clear. Follows instructions well. May lack some analytical depth or contextual nuance compared to a top score, but is a solid explanation for the artwork in the prompt.
3 (Ok): Acceptable explanation. Generally accurate and follows instructions but may be superficial, lack sufficient depth or context, have minor clarity issues, or include slightly irrelevant information. Gets the basics right for the artwork in the prompt but isn't particularly insightful.
2 (Bad): Significant issues. Contains notable inaccuracies OR fails to follow key instructions OR is poorly focused/largely irrelevant to the artwork in the prompt OR is unclear/difficult to understand OR lacks necessary context/depth. (Note: If the prompt was too vague to identify an artwork, evaluate based on how the AI handled the ambiguity).
1 (Very Bad): Fundamentally flawed. Contains major factual errors, completely ignores the prompt, is irrelevant to any reasonable interpretation of the artwork suggested by the prompt, is incoherent, or provides harmful/misleading information.

## Evaluation Steps
* STEP 1: Carefully read the user prompt to understand the request and identify the subject artwork based *only* on the prompt's content. Note any ambiguity in the prompt itself.
* STEP 2: Read the AI-generated response.
* STEP 3: Assess the response against each criterion: Instruction Following, Accuracy, Artwork Focus/Relevance, Analytical Depth, Clarity and Cohesion, and Contextualization, judging relevance and accuracy based on the artwork identified in Step 1.
* STEP 4: Determine the overall quality and assign a rating from 1 to 5 based on the rubric, justifying your score with reference to the criteria and specific examples from the response. Acknowledge if prompt limitations impacted the possible response quality.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [33]:
# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

# define a function to evaluate the generated explanation
def eval_generation(prompt, ai_response):
  """Evaluate the generated explanation against the prompt used."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=EVALUATION_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="application/json",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


In [34]:
# load generated prompts and responses
prompts_and_generation = pd.read_csv('results/baseline/prompts_and_generation.csv')

# evaluate all prompts and responses
evaluation_results = []
for i, (prompt, response) in enumerate(zip(prompts_and_generation['prompt'], prompts_and_generation['response'])):
    try:
        # Add 1 minute wait after every 8 calls
        if i > 0 and i % 8 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming evaluation...")
            
        text_eval, struct_eval = eval_generation(prompt=prompt, ai_response=response)
        evaluation_results.append({
            'prompt': prompt,
            'response': response,
            'rating': struct_eval,
            'evaluation_text': text_eval
        })
        print(f"Evaluated artwork {i + 1}/{len(prompts_and_generation)}")
        
    except Exception as e:
        print(f"Error evaluating artwork {i + 1}: {e}")
        evaluation_results.append({
            'prompt': prompt,
            'response': response,
            'rating': None,
            'evaluation_text': f"Error: {e}"
        })

# Convert results to DataFrame
evaluation_results_df = pd.DataFrame(evaluation_results)

# Save results to CSV
csv_filename = 'results/baseline/evaluation_results.csv'
evaluation_results_df.to_csv(csv_filename, index=False)
print(f"\nResults saved to: {csv_filename}")

Evaluated artwork 1/50
Evaluated artwork 2/50
Evaluated artwork 3/50
Evaluated artwork 4/50
Evaluated artwork 5/50
Evaluated artwork 6/50
Evaluated artwork 7/50
Evaluated artwork 8/50

Waiting for 1 minute after 8 calls...
Resuming evaluation...
Evaluated artwork 9/50
Evaluated artwork 10/50
Evaluated artwork 11/50
Evaluated artwork 12/50
Evaluated artwork 13/50
Evaluated artwork 14/50
Evaluated artwork 15/50
Evaluated artwork 16/50

Waiting for 1 minute after 16 calls...
Resuming evaluation...
Evaluated artwork 17/50
Evaluated artwork 18/50
Evaluated artwork 19/50
Evaluated artwork 20/50
Evaluated artwork 21/50
Evaluated artwork 22/50
Evaluated artwork 23/50
Evaluated artwork 24/50

Waiting for 1 minute after 24 calls...
Resuming evaluation...
Evaluated artwork 25/50
Evaluated artwork 26/50
Evaluated artwork 27/50
Evaluated artwork 28/50
Evaluated artwork 29/50
Evaluated artwork 30/50
Evaluated artwork 31/50
Evaluated artwork 32/50

Waiting for 1 minute after 32 calls...
Resuming eval

In [35]:
# Display summary statistics of ratings
print("\nRating distribution:")
print(evaluation_results_df['rating'].value_counts())


Rating distribution:
rating
SummaryRating.GOOD         31
SummaryRating.VERY_GOOD    19
Name: count, dtype: int64


Analysis of evalution results:
- 31 GOOD ratings (62%) - Not bad, we're batting above average! 
- 19 VERY_GOOD ratings (38%) - Pretty impressive, but we're not settling
- 0 BAD ratings - At least we didn't crash and burn! 🎉

Time for some optimization magic!


# Part 4 Some Other Experiments

## Mastering the Art of Concise Generation via Length Control

A common pain point when using audio guides in museums is that each artwork's explanation is pre-recorded and typically runs for several minutes. However, visitors' interest levels vary significantly across different artworks. When encountering pieces that don't immediately capture their attention, visitors often lack the patience to listen through lengthy descriptions.

In such scenarios, a concise initial introduction would be more appropriate. If visitors find the artwork intriguing, they can then ask follow-up questions, allowing our model to generate more detailed explanations on demand.

In this experiment, we begin by calculating the length statistics of our baseline-generated explanations, followed by implementing output length control to produce more concise explanations (limited to 150 words) to serve as initial artwork introductions.

In [36]:
# load results/baseline/evaluation_results.csv
evaluation_results_df = pd.read_csv('results/baseline/evaluation_results.csv')
# display in full width
pd.set_option('display.max_colwidth', None)
evaluation_results_df['response'][0]

'Welcome! Let\'s take a look at this powerful print by Francisco de Goya, specifically Plate 11 from his series "The Disasters of War," titled "Neither do these" or "Ni por esas" in Spanish.\n\nGoya was a Spanish artist, born in 1746 and passed away in 1828, with a long and impactful career spanning various political upheavals. This print, created around 1810, is part of a series reflecting the brutal realities of the Peninsular War, which saw Spain invaded and occupied by Napoleonic forces from around 1808 to 1814. The series as a whole is a searing indictment of war\'s inhumanity.\n\nLooking at the tags - Soldiers, Infants, Men, Women - and the title, you can see the devastating impact of conflict on civilian populations. “Neither do these” implies that not even the most vulnerable—infants—are spared the horrors of war. Goya doesn\'t shy away from portraying the raw, unvarnished truth.\n\nTechnically, this print is a "working proof," meaning it was likely pulled while Goya was still 

In [37]:
# calcualte the average number of words of the generation
evaluation_results_df['response_length'] = evaluation_results_df['response'].apply(lambda x: len(x.split()))
evaluation_results_df['response_length'].mean()

np.float64(223.7)

In [38]:
# calcualte the average number of words of the generation for each rating
evaluation_results_df.groupby('rating')['response_length'].mean()

rating
SummaryRating.GOOD         224.387097
SummaryRating.VERY_GOOD    222.578947
Name: response_length, dtype: float64

**Observations:**
- The explanations are quite lengthy overall (>220 words, which takes about 3 minutes to read)
- Interestingly, there's very little difference in length between GOOD and VERY_GOOD rated explanations (difference of only about 2 words)
- This suggests that the quality of explanations isn't strongly correlated with their length - shorter explanations aren't necessarily worse, and longer ones aren't necessarily better

This finding supports our experiment's goal to create shorter (150-word, which is about 200-tokens) explanations, as the current length doesn't seem to be a determining factor in explanation quality.

In [39]:
# extract prompts from evaluation_results_df
generated_prompts = evaluation_results_df['prompt']

In [40]:
# Generate the explanations with length control, wait for 1 minute after every 15 calls
client = genai.Client(api_key=GEMINI_API_KEY)
short_config = types.GenerateContentConfig(max_output_tokens=200)

responses = []
for i, prompt in enumerate(generated_prompts):
    try:
        # Add 1 minute wait after every 15 calls
        if i > 0 and i % 15 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming generation...")
            
        response = client.models.generate_content(
            model="gemini-2.0-flash", config=short_config, contents=prompt
        )
        responses.append(response.text)
        print(f"Generated explanation for artwork {i + 1}/{len(generated_prompts)}")
        
    except Exception as e:
        print(f"Error generating explanation for artwork {i + 1}: {e}")
        responses.append(f"Error: {e}")

# concat prompts and responses and save to a csv file
df_output = pd.concat([generated_prompts, pd.DataFrame({'response': responses})], axis=1)
df_output.to_csv('results/output_length_control/prompts_and_generation_200.csv', index=False)

Generated explanation for artwork 1/50
Generated explanation for artwork 2/50
Generated explanation for artwork 3/50
Generated explanation for artwork 4/50
Generated explanation for artwork 5/50
Generated explanation for artwork 6/50
Generated explanation for artwork 7/50
Generated explanation for artwork 8/50
Generated explanation for artwork 9/50
Generated explanation for artwork 10/50
Generated explanation for artwork 11/50
Generated explanation for artwork 12/50
Generated explanation for artwork 13/50
Generated explanation for artwork 14/50
Generated explanation for artwork 15/50

Waiting for 1 minute after 15 calls...
Resuming generation...
Generated explanation for artwork 16/50
Generated explanation for artwork 17/50
Generated explanation for artwork 18/50
Generated explanation for artwork 19/50
Generated explanation for artwork 20/50
Generated explanation for artwork 21/50
Generated explanation for artwork 22/50
Generated explanation for artwork 23/50
Generated explanation for 

In [53]:
# calcualte the average number of words of the generation
evaluation_results_df['response_length'] = evaluation_results_df['response'].apply(lambda x: len(x.split()))
evaluation_results_df['response_length'].mean()


np.float64(137.26)

In [41]:
# evaluate the generated explanations with length control
evaluation_results = []
for i, (prompt, response) in enumerate(zip(df_output['prompt'], df_output['response'])):
    try:
        # Add 1 minute wait after every 8 calls
        if i > 0 and i % 8 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming evaluation...")
            
        text_eval, struct_eval = eval_generation(prompt=prompt, ai_response=response)
        evaluation_results.append({
            'prompt': prompt,
            'response': response,
            'rating': struct_eval,
            'evaluation_text': text_eval
        })
        print(f"Evaluated artwork {i + 1}/{len(df_output)}")
        
    except Exception as e:
        print(f"Error evaluating artwork {i + 1}: {e}")
        evaluation_results.append({
            'prompt': prompt,
            'response': response,
            'rating': None,
            'evaluation_text': f"Error: {e}"
        })

# Convert results to DataFrame
evaluation_results_df = pd.DataFrame(evaluation_results)

# Save results to CSV
csv_filename = 'results/output_length_control/evaluation_results_200.csv'
evaluation_results_df.to_csv(csv_filename, index=False)
print(f"\nResults saved to: {csv_filename}")

Evaluated artwork 1/50
Evaluated artwork 2/50
Evaluated artwork 3/50
Evaluated artwork 4/50
Evaluated artwork 5/50
Error evaluating artwork 6: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_requests', 'quotaId': 'GenerateRequestsPerMinutePerProjectPerModel-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.0-flash'}, 'quotaValue': '15'}]}, {'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '52s'}]}}
E

**Perform Analysis on Evaluation Results**


In [42]:
# read the evaluation results
evaluation_results_df = pd.read_csv('results/output_length_control/evaluation_results_200.csv')

In [44]:
# check if there are any rows with rating None, this may due to the rate limit of the API
evaluation_results_df['rating'].isnull().sum()

np.int64(3)

In [45]:
# check the evaluation results rating distribution
evaluation_results_df['rating'].value_counts()


rating
SummaryRating.GOOD         32
SummaryRating.OK           11
SummaryRating.BAD           3
SummaryRating.VERY_GOOD     1
Name: count, dtype: int64

**Observations:**

Comparing our length-controlled experiment (150 words) against the baseline reveals interesting trade-offs in our art gallery explanation system:

The baseline generations (~223 words) achieved 31 GOOD and 19 VERY_GOOD ratings, with no lower ratings. Our length-controlled experiment shifted dramatically: while maintaining similar GOOD ratings (32), VERY_GOOD ratings dropped to just 1, with 11 OK and 3 BAD ratings appearing.

This suggests our 150-word limit might be too aggressive - while it creates more concise introductions, it compromises the depth and quality of explanations. Some artworks likely require more space to convey their significance effectively.

Failure case analysis: Dive into the 'BAD' generations and see what went wrong.

In [48]:
# let's print out the row where rating is bad
bad_generations = evaluation_results_df[evaluation_results_df['rating'] == 'SummaryRating.BAD'][['response', 'evaluation_text']]
print(bad_generations.shape)
bad_generations.head()

(3, 2)


Unnamed: 0,response,evaluation_text
25,"""Welcome! Let's take a closer look at this fascinating print.\n\nThis is an etching titled ""Entry of the Prince of Saxony with his Wife into Dresden on September 2, 1719, after their Marriage in Vienna."" It was created sometime between 1700 and 1755. We can see from the title and the tags – horses, soldiers, men, and carriages – that it depicts a grand procession. Specifically, it shows a ceremonial entrance into Dresden, celebrating a royal marriage.\n\nThe artwork is a collaboration. The print was likely drawn by Adolf van der Laan (1684-1755), a Dutch artist, and Pieter Schenck II, who was born in 1693, served as the publisher. Van der Laan was from Amsterdam, which was a bustling hub of artistic production at the time.\n\nThe period it represents, the early 18th century, was a time of","## Evaluation\n\n### STEP 1: Understand the Prompt and Identify the Artwork\nThe prompt provides detailed information about a print titled ""Entry of the Prince of Saxony with his Wife into Dresden on September 2, 1719, after their Marriage in Vienna,"" created around 1700-1755 by Adolf van der Laan and published by Pieter Schenck II. The prompt requests a concise explanation focusing on the artist's background, subject matter, historical significance, materials/techniques, size, and location.\n\n### STEP 2: Read the AI-Generated Response\nThe AI response provides a starting explanation of the artwork, including its title, subject matter, and the artists involved. It also touches on the historical context and the artistic background of one of the artists. However, the response is incomplete and abruptly ends mid-sentence.\n\n### STEP 3: Assess the Response Against Each Criterion\n\n* **Instruction Following:** The AI attempts to follow the instructions, addressing the artist's background, subject matter, and historical period, but it doesn't get through the materials/techniques, size, and location, likely due to the incomplete generation.\n* **Accuracy:** The information provided is accurate based on the data in the prompt.\n* **Artwork Focus/Relevance:** The explanation is directly relevant to the specified artwork.\n* **Analytical Depth:** The response starts to provide some historical context but doesn't delve deeply into the artwork's meaning or artistic techniques.\n* **Clarity and Cohesion:** The initial part of the explanation is clear and cohesive, but the abrupt ending disrupts the flow.\n* **Contextualization:** The response begins to contextualize the artwork by mentioning Amsterdam's artistic importance but doesn't fully develop this aspect.\n\n### STEP 4: Determine the Overall Quality and Assign a Rating\n\nDue to the response being incomplete and abruptly cut off, the overall quality is significantly diminished. While the initial part of the response is good and addresses several aspects of the prompt, the lack of a complete explanation and the omission of key details mentioned in the prompt (dimensions and repository) leads to a low rating.\n\n**Rating: 2 (Bad)**\n\n**Explanation:**\n\nThe response contains a significant issue as it is incomplete. It fails to fully address the instructions provided in the prompt and lacks essential information, such as the artwork's dimensions and repository.\n"
30,"Good morning! Welcome to the Drawings and Prints section. Today, I'd like to draw your attention to this fascinating piece entitled ""Design for a Sofa Back Cover (?) with an Ornamental Frame Containing a Large Horizontal Garland of Flowers and Leaves with Musical Instruments and a Vase, Decorated with Acanthus Leaves and Two Fleurs de Lys.""\n\nUnfortunately, the artist is currently listed as ""Anonymous, French, 19th century,"" so we don't have a specific name to associate with this work. We know they were working in France sometime between 1800 and 1900, designing objects such as this potential sofa back cover.\n\nThe object itself is dated between 1870 and 1900, giving us a window into the late 19th century, a period known for its opulent interiors and decorative arts. The title gives us a good idea of the subject matter. It’s a design, likely for a textile, meant to adorn","## Evaluation\n\n### STEP 1: Understand the Prompt\nThe prompt requests an explanation of an artwork based on provided metadata. The artwork is a design drawing for a sofa back cover, created by an anonymous French artist in the late 19th century. The response should cover the artist's background, subject matter, historical significance, materials/techniques, size, and location.\n\n### STEP 2: Read the AI-Generated Response\nThe AI provides an initial welcome and identifies the artwork. It notes the anonymous artist and the potential purpose of the design. The response is cut off at the end of the second paragraph.\n\n### STEP 3: Assess the Response\n* **Instruction Following:** The response starts well, addressing the provided details. It successfully mentions the artist's anonymity, nationality, and date range, as well as the title and potential subject matter. However, because the response is cut off, it doesn't fully address all the instructions. It does not include the material and techniques used, its size or its location.\n* **Accuracy:** The information provided is accurate based on the prompt.\n* **Artwork Focus/Relevance:** The explanation focuses on the identified artwork.\n* **Analytical Depth:** The response starts to offer some context regarding the late 19th century and its decorative arts, but it's cut short.\n* **Clarity and Cohesion:** The explanation is clear and well-organized so far.\n* **Contextualization:** The response begins to contextualize the artwork within its historical period.\n\n### STEP 4: Determine the Overall Quality and Assign a Rating\nDue to the incomplete nature of the response, the rating is capped. While it correctly identifies the artwork and begins to address the instructions, it fails to deliver a complete explanation as requested.\n\n**Rating: 2 (Bad)**\n\n**Explanation:** The response is incomplete and fails to fully address the instructions due to being cut off. While the information presented is accurate and relevant, the lack of a complete explanation warrants a ""Bad"" rating.\n"
48,"Welcome! Let me draw your attention to this fascinating print titled ""Speculum Romanae Magnificentiae: Circus Flaminius.""\n\nThis engraving, created in 1553, gives us a glimpse into the grandeur of ancient Rome. It depicts the Circus Flaminius, a significant architectural landmark of the time, dedicated to public games and events. You can see that it's tagged with both ""Architecture"", ""Circus"" and ""Rome"", giving a good sense of what to expect from it.\n\nThe print is a collaborative effort. Nicolas Beatrizet, a French artist born in Lunéville in 1515, and active in Rome, is credited as an artist on this piece. His work is joined by Pirro Ligorio, an Italian artist from Naples, who also contributed to the work. Finally, Michele Tramezzino, an Italian publisher active in Venice and Rome, played a vital role in making this print available to a wider audience. This collaboration highlights the","## Evaluation\n\n### STEP 1: Understand the Prompt\nThe prompt requests an explanation of the print ""Speculum Romanae Magnificentiae: Circus Flaminius,"" providing specific details about the artwork, artists, and its historical context. The AI is instructed to act as a tour guide and focus on artist background, subject matter, historical period, materials/techniques, size, and location.\n\n### STEP 2: Read the AI-Generated Response\nThe AI-generated response provides an introduction to the print, identifies the Circus Flaminius as the subject, and mentions the artists involved. It also notes the medium and collaborative nature of the print. However, the response is incomplete as it cuts off mid-sentence.\n\n### STEP 3: Assess the Response Against Each Criterion\n\n* **Instruction Following:** Partially follows instructions. It starts addressing the requested points but doesn't complete them due to the truncation.\n* **Accuracy:** The information provided is accurate based on the provided details.\n* **Artwork Focus/Relevance:** The explanation is relevant to the artwork identified in the prompt.\n* **Analytical Depth:** Lacks analytical depth due to the incomplete nature of the response.\n* **Clarity and Cohesion:** The beginning of the response is clear and cohesive, but it abruptly ends.\n* **Contextualization:** Touches upon contextualization by mentioning the grandeur of ancient Rome but doesn't fully develop this.\n\n### STEP 4: Determine the Overall Quality and Assign a Rating\n\nGiven that the response is incomplete, it's difficult to give it a high rating. While the information presented is accurate and relevant, the lack of completion significantly hinders its overall quality. I will rate it a 2 (Bad) due to the significant issue of being incomplete. Had the response not been cut off, it would have likely scored higher.\n"


Use AI to perform anlysis on the bad generations and identify common issues.

In [49]:
ANALYSIS_PROMPT = """\
# Instruction
You are an expert analyst tasked with identifying and explaining why an AI-generated artwork explanation received a poor rating. Your task is to analyze both the AI's response and the evaluator's feedback to clearly summarize the key issues that led to the low rating.

Please provide:
1. A concise list of the main problems identified in the response
2. Specific examples of these issues from the response
3. A brief explanation of how these issues impacted the overall quality

# Input
## AI Response:
{response}

## Evaluator's Feedback:
{evaluation_text}

# Expected Output Format
Please provide your analysis in the following format:

SUMMARY OF ISSUES:
- [List key problems identified]

SPECIFIC EXAMPLES:
- [Cite relevant examples from the response]

IMPACT:
[Brief explanation of how these issues affected the response quality]
"""

In [50]:
client = genai.Client(api_key=GEMINI_API_KEY)

analyses = []
for i, (response, evaluation_text) in enumerate(zip(bad_generations['response'], bad_generations['evaluation_text'])):
    try:
        # Add 1 minute wait after every 10 calls
        if i > 0 and i % 10 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming analysis...")
            
        chat = client.chats.create(model='gemini-2.0-flash')
        analysis = chat.send_message(
            message=ANALYSIS_PROMPT.format(
                response=response,
                evaluation_text=evaluation_text
            )
        )
        analyses.append(analysis.text)
        print(f"Analyzed bad response {i + 1}/{len(bad_generations)}")
        
    except Exception as e:
        print(f"Error analyzing response {i + 1}: {e}")
        analyses.append(f"Error: {e}")

# Add analyses to the DataFrame and save to csv
bad_generations['analysis'] = analyses
bad_generations.to_csv('results/output_length_control/bad_generations_analysis.csv', index=False)

Analyzed bad response 1/3
Analyzed bad response 2/3
Analyzed bad response 3/3


In [52]:
# print out the analysis of the bad generations
bad_generations[['response', 'analysis']].head(3)

Unnamed: 0,response,analysis
25,"""Welcome! Let's take a closer look at this fascinating print.\n\nThis is an etching titled ""Entry of the Prince of Saxony with his Wife into Dresden on September 2, 1719, after their Marriage in Vienna."" It was created sometime between 1700 and 1755. We can see from the title and the tags – horses, soldiers, men, and carriages – that it depicts a grand procession. Specifically, it shows a ceremonial entrance into Dresden, celebrating a royal marriage.\n\nThe artwork is a collaboration. The print was likely drawn by Adolf van der Laan (1684-1755), a Dutch artist, and Pieter Schenck II, who was born in 1693, served as the publisher. Van der Laan was from Amsterdam, which was a bustling hub of artistic production at the time.\n\nThe period it represents, the early 18th century, was a time of","SUMMARY OF ISSUES:\n- Incomplete response; the text abruptly ends mid-sentence.\n- Fails to address all aspects of the prompt, specifically omitting information about materials/techniques, size, and location of the artwork.\n- Lacks analytical depth beyond stating basic facts.\n\nSPECIFIC EXAMPLES:\n- ""The period it represents, the early 18th century, was a time of"" - Abrupt ending.\n- The response mentions the artists' background and the subject matter, but does not discuss the materials and techniques used in creating the etching, the size of the print, or where it is currently located (repository).\n\nIMPACT:\nThe incomplete response severely limits its usefulness and comprehensiveness. The abrupt ending leaves the user without a complete understanding of the artwork. Furthermore, the omission of key information, as specified in the prompt, renders the response unsatisfactory. This significantly lowers the overall quality and leads to a ""Bad"" rating.\n"
30,"Good morning! Welcome to the Drawings and Prints section. Today, I'd like to draw your attention to this fascinating piece entitled ""Design for a Sofa Back Cover (?) with an Ornamental Frame Containing a Large Horizontal Garland of Flowers and Leaves with Musical Instruments and a Vase, Decorated with Acanthus Leaves and Two Fleurs de Lys.""\n\nUnfortunately, the artist is currently listed as ""Anonymous, French, 19th century,"" so we don't have a specific name to associate with this work. We know they were working in France sometime between 1800 and 1900, designing objects such as this potential sofa back cover.\n\nThe object itself is dated between 1870 and 1900, giving us a window into the late 19th century, a period known for its opulent interiors and decorative arts. The title gives us a good idea of the subject matter. It’s a design, likely for a textile, meant to adorn","SUMMARY OF ISSUES:\n- Incomplete response.\n- Fails to address all aspects of the prompt (materials/techniques, size, location).\n- Lacks depth in analysis and contextualization due to abrupt ending.\n\nSPECIFIC EXAMPLES:\n- ""The title gives us a good idea of the subject matter. It’s a design, likely for a textile, meant to adorn"" - This sentence is unfinished, indicating the response was cut off.\n- The response mentions the ""anonymous, French, 19th century"" artist but does not explore potential reasons for anonymity within the art world during that period. It also fails to mention where the piece is currently located, the materials used, or the techniques employed to create it.\n- The reference to ""opulent interiors and decorative arts"" is a good starting point for historical context, but it's not developed further because the response ends prematurely.\n\nIMPACT:\nThe incomplete response significantly limits the overall quality. The failure to address all aspects of the prompt, especially regarding materials, size, and location, results in a superficial analysis. The lack of a complete explanation and deeper contextualization prevents the AI from demonstrating a thorough understanding of the artwork and its significance.\n"
48,"Welcome! Let me draw your attention to this fascinating print titled ""Speculum Romanae Magnificentiae: Circus Flaminius.""\n\nThis engraving, created in 1553, gives us a glimpse into the grandeur of ancient Rome. It depicts the Circus Flaminius, a significant architectural landmark of the time, dedicated to public games and events. You can see that it's tagged with both ""Architecture"", ""Circus"" and ""Rome"", giving a good sense of what to expect from it.\n\nThe print is a collaborative effort. Nicolas Beatrizet, a French artist born in Lunéville in 1515, and active in Rome, is credited as an artist on this piece. His work is joined by Pirro Ligorio, an Italian artist from Naples, who also contributed to the work. Finally, Michele Tramezzino, an Italian publisher active in Venice and Rome, played a vital role in making this print available to a wider audience. This collaboration highlights the","SUMMARY OF ISSUES:\n- Incomplete response, cutting off mid-sentence.\n- Lacks analytical depth due to incompleteness.\n- Insufficient contextualization despite acknowledging the historical period.\n\nSPECIFIC EXAMPLES:\n- ""This collaboration highlights the"" - sentence abruptly ends, leaving the explanation unfinished.\n- While it mentions ""grandeur of ancient Rome,"" the response doesn't elaborate on the historical context or significance of the Circus Flaminius within that period.\n\nIMPACT:\nThe incomplete nature of the response is the most significant factor leading to the low rating. It prevents the AI from fully addressing the prompt's requirements regarding artist background, subject matter, historical period, materials/techniques, size, and location. The lack of completion makes it difficult to assess the AI's analytical depth and contextualization abilities. Because the response ends abruptly, the user is left with an unfinished thought.\n"


The main reason for a 'BAD' generation is "Incomplete response", which is possible as we are controling the output length.

## The Eyes Have It: Enhancing Art Interpretation with Visual Intelligence

Both our baseline experiment and length control trials relied solely on textual content for generating artwork explanations. We propose enhancing this approach by leveraging multimodal information, specifically by extracting additional valuable insights from artwork images to supplement the text-based descriptions. This visual integration would make the generated explanations more aligned with visitors' actual viewing experience.

Moreover, this multimodal approach offers another significant advantage: even when textual information about an artwork is incomplete or missing, the system can still generate meaningful descriptions by analyzing the visual elements.

The Gemini API can generate text output in response to various inputs, including text, images, video, and audio. Please check out the gemini API documenation page for more information.

https://ai.google.dev/gemini-api/docs/text-generation

In [54]:
# load the selected dataframe
df_selected = pd.read_csv('data/sample50_drawings_and_prints_cleaned.csv')

In [55]:
# For baseline, we do not use id and url into the prompt
url_cols = ['Primary Image URL','Tags AAT URL','Tags Wikidata URL']
id_cols = ['Object ID','Object Number','Constituent ID']

baseline_cols = list(set(df_selected.columns) - set(url_cols) - set(id_cols))

# Now, we add the 'Primary Image URL' in addition to baseline_cols for the image input
cols_with_image = baseline_cols + ['Primary Image URL']

In [56]:
# Extract selected columns from df_selected
df_selected_image = df_selected[cols_with_image]

# print the shape of the dataframe
print(df_selected_image.shape)

# print the column names, which contains image info +  artist info (same as Baseline) and 'Primary Image URL' 
print(df_selected_image.columns)

(50, 18)
Index(['Artist End Date', 'Object Begin Date', 'Department', 'Object Date',
       'Artist Display Name', 'Artist Role', 'Dimensions', 'Medium',
       'Object End Date', 'Object Name', 'Title', 'Artist Begin Date',
       'Classification', 'Repository', 'Artist Display Bio',
       'Artist Nationality', 'Tags', 'Primary Image URL'],
      dtype='object')


In [57]:
# Test for gemini api with the first row of the dataframe
image_path = df_selected_image['Primary Image URL'].iloc[0]
print('image_path: ', image_path)
image = requests.get(image_path)

client = genai.Client(api_key=GEMINI_API_KEY)
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=["What is this image?",
              types.Part.from_bytes(data=image.content, mime_type="image/jpeg")])

print(response.text)

image_path:  https://images.metmuseum.org/CRDImages/dp/original/DP817292.jpg
The image appears to be an etching by Francisco Goya, specifically from "The Disasters of War" series. The scene depicts a violent and distressing event, likely related to the Peninsular War. It shows soldiers attacking and killing civilians, including women and children. The composition is dark and dramatic, emphasizing the brutality and horror of war.


<img src="sample_image/sample_image.png" alt="Sample Image" width="300" height="300">

Modify the baseline prompt template to combine text and image information in a structured way where:

1. Text serves as the primary source for facts and details
2. Images play a supporting role to:
   - Verify what's mentioned in the text
   - Add visual descriptions (colors, composition, textures)
   - Enhance the explanation with observable details

In [60]:
prompt_template_with_image = """
You are a helpful tour guide in an art gallery.
Please provide an explanation and insights for the following artwork to the visitor. Use the provided text details as the main source of information, and refer to the accompanying image for visual context and potential enhancement of your description.

**Artwork Details (Text):**
{details}

**Instructions:**
1.  Base your explanation primarily on the **Artwork Details (Text)** provided above.
2.  Use the image as a visual reference to potentially add descriptive nuances or confirm visual aspects mentioned in the text details.
3.  Generate a concise explanation focusing on:
    * The artist's background and context (from text details).
    * The artwork's subject matter (from text details, potentially referencing specific elements visible in the image).
    * The historical period and significance (from text details).
    * The materials and techniques used (from text details, potentially adding visual description like visible brushstrokes or texture if apparent in the image).
    * Mention its size and location (from text details).
4.  Ensure the core facts come from the **Artwork Details (Text)**, but feel free to enrich the description with observations directly inspired by viewing the **Artwork Image URL** (e.g., commenting on composition, color palette, specific visual details).

**Generated Explanation:**
[Generate the explanation here, primarily based on the text details but enhanced by visual reference to the image]
"""

In [61]:
# helper function to generate prompts using the template
def generate_artwork_prompts(df, cols, prompt_template):
    """
    Generate prompts for artwork descriptions based on dataframe columns
    
    Args:
        df (pd.DataFrame): DataFrame containing artwork information
        cols (list): List of column names to include in prompts
        prompt_template (str): Template string for generating prompts with {details} placeholder
        
    Returns:
        list: List of generated prompts
    """
    # List to store the generated prompts 
    generated_prompts = []

    print("\n--- Generating Prompts ---")

    # Iterate through each row of the DataFrame
    for index, row in df.iterrows():
        # Create the details string for the current artwork
        details_str = ""
        for col_name in cols: # Iterate through provided columns
            # Ensure the value is converted to string, handle potential None/NaN values
            value = str(row[col_name]) if pd.notna(row[col_name]) else "N/A"
            details_str += f"* **{col_name}:** {value}\n"

        # Format the full prompt using the template and the generated details string
        full_prompt = prompt_template.format(details=details_str.strip())
        generated_prompts.append(full_prompt)
        
    return generated_prompts

In [62]:
# generate text prompts using baseline cols(without image url),
# because we will parse the image as a separate input during api call
generated_prompts = generate_artwork_prompts(df_selected_image, baseline_cols, prompt_template_with_image)


--- Generating Prompts ---


In [64]:
# --- Display Example Prompt ---
if generated_prompts:
    print("\nExample of a generated prompt (for the first artwork):")
    print(generated_prompts[0])
    print(f"\nTotal prompts generated: {len(generated_prompts)}")
else:
    print("\nNo prompts were generated (DataFrame might be empty or columns missing).")


Example of a generated prompt (for the first artwork):

You are a helpful tour guide in an art gallery.
Please provide an explanation and insights for the following artwork to the visitor. Use the provided text details as the main source of information, and refer to the accompanying image for visual context and potential enhancement of your description.

**Artwork Details (Text):**
* **Artist End Date:** 1828      
* **Object Begin Date:** 1805
* **Department:** Drawings and Prints
* **Object Date:** ca. 1810
* **Artist Display Name:** Goya (Francisco de Goya y Lucientes)
* **Artist Role:** Artist
* **Dimensions:** Plate: 6 5/16 × 8 3/8 in. (16.1 × 21.2 cm)
Sheet: 8 3/8 × 12 3/4 in. (21.3 × 32.4 cm)
* **Medium:** Etching, drypoint, burin (working proof)
* **Object End Date:** 1815
* **Object Name:** Print
* **Title:** Plate 11 from "The Disasters of War" (Los Desastres de la Guerra): 'Neither do these' (Ni por esas)
* **Artist Begin Date:** 1746      
* **Classification:** Prints
* **

In [65]:
# helper function to generate explanation with text and image
def generate_explanation_with_image(prompt, image_url):
    """
    Generate an explanation for an artwork using Gemini API
    
    Args:
        prompt (str): Text prompt for the image
        image_url (str): URL of the image to analyze
        
    Returns:
        str: Generated explanation from Gemini API
    """
    try:
        # Get image data from URL
        image = requests.get(image_url)
        
        # Initialize Gemini client
        client = genai.Client(api_key=GEMINI_API_KEY)
        
        # Generate content using both text and image
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                prompt,
                types.Part.from_bytes(data=image.content, mime_type="image/jpeg")
            ]
        )
        
        return response.text
        
    except Exception as e:
        return f"Error generating explanation: {str(e)}"

**Sample Test:**

Generate explanation with text and image for the first artwork.

Compare it with the baseline generation that only uses text.

In [66]:
# Test: Generate explanation with text and image for the first artwork
test_prompt = generated_prompts[0]
test_image_url = df_selected_image['Primary Image URL'].iloc[0]

# generate explanation
explanation = generate_explanation_with_image(test_prompt, test_image_url)
print(explanation)


Welcome to the gallery. Let's take a closer look at this powerful print.

This is Plate 11 from "The Disasters of War," a series by Francisco de Goya, a Spanish artist born in 1746 and who died in 1828. Goya was a significant figure, witnessing immense social and political upheaval during his lifetime. This particular print, titled "Neither do these" ('Ni por esas'), was created around 1810, placing it squarely within the period of the Peninsular War. The whole series created between 1805 and 1815, is a direct response to the brutality and devastation he witnessed during this conflict.

Looking at the image, you can see that it depicts a scene of violence. There are soldiers present, and the central figures appear to be civilians – men, women, and even infants are represented. The image is a stark depiction of the atrocities inflicted upon the civilian population during the war.

Goya employed etching, drypoint, and burin techniques to create this print. This combination of methods all

In [67]:
# load basline generation
baseline_generations = pd.read_csv('results/baseline/prompts_and_generation.csv')
# print the generation text of the first artwork
baseline_generations['response'].iloc[0]

'Welcome! Let\'s take a look at this powerful print by Francisco de Goya, specifically Plate 11 from his series "The Disasters of War," titled "Neither do these" or "Ni por esas" in Spanish.\n\nGoya was a Spanish artist, born in 1746 and passed away in 1828, with a long and impactful career spanning various political upheavals. This print, created around 1810, is part of a series reflecting the brutal realities of the Peninsular War, which saw Spain invaded and occupied by Napoleonic forces from around 1808 to 1814. The series as a whole is a searing indictment of war\'s inhumanity.\n\nLooking at the tags - Soldiers, Infants, Men, Women - and the title, you can see the devastating impact of conflict on civilian populations. “Neither do these” implies that not even the most vulnerable—infants—are spared the horrors of war. Goya doesn\'t shy away from portraying the raw, unvarnished truth.\n\nTechnically, this print is a "working proof," meaning it was likely pulled while Goya was still 

**Obvervations:**
- First Generation (Image + Text): Includes specific visual details like "looking at the image, you can see that it depicts a scene of violence." The description is more grounded in what is actually visible in the artwork.

- Second Generation (Text Only): Relies more on the provided tags and metadata. Makes broader statements about the content without specific visual details.

In [68]:
# Generate explanations for all artworks with text and image
# Get image URLs from the dataframe
image_urls = df_selected_image['Primary Image URL'].tolist()

# Generate explanations for all artworks
print("\nGenerating explanations for all artworks...")
responses = []

# Process each artwork
for i, (prompt, image_url) in enumerate(tqdm(zip(generated_prompts, image_urls))):
    try:
        # Add delay after every 15 calls to respect rate limits
        if i > 0 and i % 15 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming generation...")
        
        # Generate explanation using the existing function
        response = generate_explanation_with_image(prompt, image_url)
        responses.append(response)
        print(f"Generated explanation for artwork {i + 1}/{len(generated_prompts)}")
        
    except Exception as e:
        print(f"Error processing artwork {i + 1}: {e}")
        responses.append(f"Error: {e}")

# Create DataFrame with prompts, image URLs, and responses
results_df = pd.DataFrame({
    'prompt': generated_prompts,
    'image_url': image_urls,
    'response': responses
})



Generating explanations for all artworks...


1it [00:03,  3.80s/it]

Generated explanation for artwork 1/50


2it [00:10,  5.40s/it]

Generated explanation for artwork 2/50


3it [00:21,  7.85s/it]

Generated explanation for artwork 3/50


4it [00:25,  6.50s/it]

Generated explanation for artwork 4/50


5it [00:34,  7.25s/it]

Generated explanation for artwork 5/50


6it [00:38,  6.29s/it]

Generated explanation for artwork 6/50


7it [00:48,  7.37s/it]

Generated explanation for artwork 7/50


8it [00:53,  6.85s/it]

Generated explanation for artwork 8/50


9it [01:01,  7.08s/it]

Generated explanation for artwork 9/50


10it [01:06,  6.47s/it]

Generated explanation for artwork 10/50


11it [01:14,  6.83s/it]

Generated explanation for artwork 11/50


12it [01:21,  6.95s/it]

Generated explanation for artwork 12/50


13it [01:28,  6.98s/it]

Generated explanation for artwork 13/50


14it [01:35,  7.01s/it]

Generated explanation for artwork 14/50


15it [01:42,  6.99s/it]

Generated explanation for artwork 15/50

Waiting for 1 minute after 15 calls...
Resuming generation...


16it [02:46, 24.11s/it]

Generated explanation for artwork 16/50


17it [02:51, 18.35s/it]

Generated explanation for artwork 17/50


18it [02:58, 15.01s/it]

Generated explanation for artwork 18/50


19it [03:03, 11.86s/it]

Generated explanation for artwork 19/50


20it [03:08,  9.90s/it]

Generated explanation for artwork 20/50


21it [03:14,  8.87s/it]

Generated explanation for artwork 21/50


22it [03:18,  7.38s/it]

Generated explanation for artwork 22/50


23it [03:24,  6.98s/it]

Generated explanation for artwork 23/50


24it [03:34,  7.80s/it]

Generated explanation for artwork 24/50


25it [03:43,  8.19s/it]

Generated explanation for artwork 25/50


26it [03:51,  8.10s/it]

Generated explanation for artwork 26/50


27it [03:56,  7.11s/it]

Generated explanation for artwork 27/50


28it [03:59,  5.80s/it]

Generated explanation for artwork 28/50


29it [04:04,  5.70s/it]

Generated explanation for artwork 29/50


30it [04:07,  4.97s/it]

Generated explanation for artwork 30/50

Waiting for 1 minute after 30 calls...
Resuming generation...


31it [05:13, 23.27s/it]

Generated explanation for artwork 31/50


32it [05:20, 18.17s/it]

Generated explanation for artwork 32/50


33it [05:25, 14.49s/it]

Generated explanation for artwork 33/50


34it [05:29, 11.20s/it]

Generated explanation for artwork 34/50


35it [05:33,  8.94s/it]

Generated explanation for artwork 35/50


36it [05:39,  8.29s/it]

Generated explanation for artwork 36/50


37it [05:46,  7.66s/it]

Generated explanation for artwork 37/50


38it [06:00,  9.60s/it]

Generated explanation for artwork 38/50


39it [06:05,  8.23s/it]

Generated explanation for artwork 39/50


40it [06:08,  6.66s/it]

Generated explanation for artwork 40/50


41it [06:12,  6.01s/it]

Generated explanation for artwork 41/50


42it [06:18,  5.93s/it]

Generated explanation for artwork 42/50


43it [06:23,  5.72s/it]

Generated explanation for artwork 43/50


44it [06:28,  5.58s/it]

Generated explanation for artwork 44/50


45it [06:36,  6.04s/it]

Generated explanation for artwork 45/50

Waiting for 1 minute after 45 calls...
Resuming generation...


46it [07:44, 24.65s/it]

Generated explanation for artwork 46/50


47it [07:52, 19.78s/it]

Generated explanation for artwork 47/50


48it [08:00, 16.28s/it]

Generated explanation for artwork 48/50


49it [08:09, 14.17s/it]

Generated explanation for artwork 49/50


50it [08:12,  9.86s/it]

Generated explanation for artwork 50/50





In [70]:
# Save to CSV
output_file = 'results/image_input/prompts_and_generation.csv'
results_df.to_csv(output_file, index=False)
print(f"\nResults saved to: {output_file}")

# Display first few results
print("\nFirst few generations:")
print(results_df.head(1)['response'])


Results saved to: results/image_input/prompts_and_generation.csv

First few generations:
0    Welcome! Let's take a look at this powerful print.\n\nThis is "Plate 11 from 'The Disasters of War': 'Neither do these'" by Francisco de Goya. Goya was a Spanish artist born in 1746 and died in 1828. He was a prominent figure in Spanish art, witnessing and reacting to significant historical events of his time.\n\nThis particular print, created around 1810 and 1815, is part of a series depicting the brutal realities of the Peninsular War (1808-1814). The series is a stark commentary on the atrocities committed during the conflict.\n\nLooking at the image, we can see the artist's focus on the impact of war on civilians. The soldiers, men, women, and even infants are the tags for this picture. The title "Neither do these" along with the image indicates a scene of cruelty and suffering inflicted upon innocents.\n\nTechnically, this print is a working proof, realized using etching, drypoint, and b

**Evaluation**

In [71]:
# load the generation with image input
image_generations = pd.read_csv('results/image_input/prompts_and_generation.csv')

Develop the **Text+Image evaluation framework** builds upon the baseline text-only approach by incorporating visual assessment capabilities while maintaining the core 1-5 rating scale. 

In addition to the baseline's textual analysis criteria (Instruction Following, Accuracy, Artwork Focus, Analytical Depth, Clarity, and Contextualization), it introduces visual-specific evaluation components including image usage relevance and visual accuracy verification. 

In [72]:
# Define the evaluation prompt
ARTWORK_EXPLANATION_PROMPT_TEXT_PRIMARY = """\
# Instruction
You are an expert evaluator, knowledgeable in art history and visual analysis. Your task is to evaluate the quality of the responses generated by AI models that provide explanations for artworks, using a **primary text prompt** and a **supplementary image input**.
We will provide you with the user's inputs (the primary text prompt and the supplementary image) and the AI-generated response.
You should first **read the primary text prompt carefully** to understand the core task and requirements. Then, examine the supplementary image to understand the visual context provided. Your evaluation should focus on how well the AI addresses the **primary text prompt**, assessing if and how effectively it utilizes the **supplementary image for relevant illustration, evidence, or visual grounding**.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Provide a step-by-step explanation for your rating, referencing specific parts of the response, the text prompt, and visual details **if the image was appropriately used**. Only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing **Artwork Explanation Quality (Text-Primary + Image-Auxiliary)**. This measures the AI's ability to provide an accurate, insightful, relevant, and clear explanation that **primarily addresses the requirements of the text prompt**. It also assesses the AI's ability to **appropriately and accurately integrate relevant details from the supplementary image** to enhance or support the explanation, without being unduly distracted by irrelevant visual information.

## Criteria
* **Instruction Following (Text Prompt)**: The response demonstrates a clear understanding of the **primary text prompt** and directly addresses its core requirements (e.g., specific question, focus, theme, audience). **This is the most important criterion.**
* **Accuracy (Factual & Visual)**: Information presented related to the text prompt's subject is factually correct. **When visual details from the image are referenced** to support the explanation, those descriptions must accurately reflect the supplementary image.
* **Relevant Visual Support**: When the explanation references the supplementary image, the visual details mentioned are **directly relevant to the points being made in response to the text prompt**. The response uses the image effectively as support where appropriate, without letting tangential visual details derail the main focus set by the text prompt. Assesses if the image *enhances* the answer to the text prompt.
* **Analytical Depth (Text-Focused)**: The response offers insights (e.g., into technique, symbolism, context, intent) that are relevant to the **focus defined by the text prompt**. Visual details from the image may be used effectively as evidence or illustration for this analysis.
* **Clarity and Cohesion**: The explanation is well-organized, uses clear language appropriate for the audience specified or implied in the text prompt, and flows logically, focusing on answering the text prompt.
* **Contextualization (Prompt-Driven)**: Provides relevant context (artist, period, etc.) that aids in understanding the topic raised in the **text prompt**. May connect this context to the image if relevant to the explanation.

## Rating Rubric
* **5 (Very Good)**: Perfectly addresses the **text prompt** with accuracy, depth, and clarity. **Skillfully and relevantly integrates details from the supplementary image** to effectively support or illustrate the points made. Excels in all criteria, demonstrating mastery of the text task enhanced by the image.
* **4 (Good)**: Strongly addresses the **text prompt** accurately and clearly. **Uses the image appropriately and accurately** for relevant support, though perhaps less insightfully or seamlessly than a 5. A solid response primarily focused on the text prompt.
* **3 (Ok)**: Acceptably addresses the main points of the **text prompt**, but may be somewhat superficial, lack depth, or have minor clarity issues. **Uses the image, but potentially in a limited, slightly irrelevant, or non-insightful way**, or with minor visual inaccuracies. Gets the basics of the text prompt right.
* **2 (Bad)**: **Significant issues addressing the text prompt** (major inaccuracies, instruction failure, lack of focus) OR **significantly misuses, misinterprets, or irrelevantly focuses on the image** instead of using it as support for the text prompt. Fails on key aspects of the primary task.
* **1 (Very Bad)**: Fundamentally fails to address the **text prompt** OR relies on major factual/visual errors or hallucinations related to the core task. **Ignores the text prompt's focus in favor of the image, or completely misuses the image** context.

## Evaluation Steps
* STEP 1: Carefully read the **primary text prompt** to fully understand the core request and its constraints.
* STEP 2: Examine the **supplementary image** to understand the visual context provided.
* STEP 3: Read the AI-generated response.
* STEP 4: Assess how well the response addresses the **text prompt** based on Instruction Following, Accuracy (factual), Analytical Depth, Contextualization, and Clarity.
* STEP 5: Evaluate **how effectively and accurately the supplementary image was used** based on Relevant Visual Support and Accuracy (visual). Consider if the image use enhanced the answer to the text prompt or distracted from it.
* STEP 6: Determine the overall quality, **giving primary weight to how well the text prompt was addressed (Step 4)**, while factoring in the appropriate use of the image (Step 5). Assign a rating from 1 to 5 based on the rubric, justifying your score.

# User Inputs (Text Primary, Image Supplementary) and AI-generated Response
## User Inputs

### Primary Text Prompt
{prompt}

### Supplementary Image Input
The evaluator will be shown or given access to the corresponding image input separately

## AI-generated Response
{response}
"""

In [88]:
# Define a structured enum class to capture the result.
class ArtworkRating(enum.Enum):
    VERY_GOOD = '5'
    GOOD = '4'
    OK = '3'
    BAD = '2'
    VERY_BAD = '1'

def eval_artwork_with_image(prompt, ai_response, image_url):
    """
    Evaluate an artwork explanation that was generated with image input.
    
    Args:
        prompt (str): The original prompt used to generate the response
        ai_response (str): The AI-generated response to evaluate
        image_url (str): URL of the artwork image
        
    Returns:
        tuple: (verbose_eval, structured_eval)
            - verbose_eval (str): Detailed evaluation text
            - structured_eval (ArtworkRating): Structured rating enum value
    """
    try:
        # Get image data from URL
        image = requests.get(image_url)
        
        # Generate the full text evaluation with both text and image
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                ARTWORK_EXPLANATION_PROMPT_TEXT_PRIMARY.format(
                    prompt=prompt,
                    response=ai_response
                ),
                types.Part.from_bytes(data=image.content, mime_type="image/jpeg")
            ]
        )
        verbose_eval = response.text

        # Get structured rating
        structured_output_config = types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=ArtworkRating,
        )
        rating_response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents="Extract Rating score from the evaluation text: " + verbose_eval,
            config=structured_output_config,
        )
        structured_eval = rating_response.parsed

        return verbose_eval, structured_eval
        
    except Exception as e:
        return f"Error during evaluation: {str(e)}", None

In [90]:
# Evaluate all artwork explanations generated with image input.

# Create a copy of image_generations to store evaluation results
evaluation_results_df = image_generations.copy()

# Add new columns for evaluation results if they don't exist
if 'evaluation_text' not in evaluation_results_df.columns:
    evaluation_results_df['evaluation_text'] = None
if 'rating' not in evaluation_results_df.columns:
    evaluation_results_df['rating'] = None

print("\nStarting evaluation process...")

# Process each row
for i, row in tqdm(evaluation_results_df.iterrows(), total=len(evaluation_results_df)):
    # Skip if already evaluated
    if pd.notna(row['evaluation_text']):
        continue
        
    try:
        # Add delay after every 8 calls to respect rate limits
        if i > 0 and i % 8 == 0:
            print(f"\nWaiting for 1 minute after {i} calls...")
            time.sleep(60)  # Wait for 60 seconds
            print("Resuming evaluation...")
        
        # Generate evaluation using the existing function
        verbose_eval, structured_eval = eval_artwork_with_image(
            prompt=row['prompt'],
            ai_response=row['response'],
            image_url=row['image_url']
        )
        
        # Store results
        evaluation_results_df.at[i, 'evaluation_text'] = verbose_eval
        evaluation_results_df.at[i, 'rating'] = structured_eval
        
        # Save progress after each evaluation
        evaluation_results_df.to_csv('results/image_input/evaluation_results_with_image.csv', index=False)
        
        print(f"Evaluated artwork {i + 1}/{len(evaluation_results_df)}")
        
    except Exception as e:
        print(f"Error processing artwork {i + 1}: {e}")
        evaluation_results_df.at[i, 'evaluation_text'] = f"Error: {e}"
        evaluation_results_df.at[i, 'rating'] = None
        # Save progress even when error occurs
        evaluation_results_df.to_csv('results/image_input/evaluation_results_with_image.csv', index=False)

print("\nEvaluation process completed!")

# Display summary statistics of ratings
print("\nRating distribution:")
print(evaluation_results_df['rating'].value_counts())

# Display first result as example
print("\nExample evaluation (first row):")
print("Rating:", evaluation_results_df['rating'].iloc[0])
print("\nEvaluation text:")
print(evaluation_results_df['evaluation_text'].iloc[0])


Starting evaluation process...


  2%|▏         | 1/50 [00:04<04:04,  4.99s/it]

Evaluated artwork 1/50


  4%|▍         | 2/50 [00:10<04:07,  5.16s/it]

Evaluated artwork 2/50


  6%|▌         | 3/50 [00:15<03:55,  5.00s/it]

Evaluated artwork 3/50


  8%|▊         | 4/50 [00:20<04:00,  5.22s/it]

Evaluated artwork 4/50


 10%|█         | 5/50 [00:26<04:11,  5.58s/it]

Evaluated artwork 5/50


 12%|█▏        | 6/50 [00:32<04:05,  5.59s/it]

Evaluated artwork 6/50


 14%|█▍        | 7/50 [00:37<03:58,  5.54s/it]

Evaluated artwork 7/50


 16%|█▌        | 8/50 [00:43<03:47,  5.41s/it]

Evaluated artwork 8/50

Waiting for 1 minute after 8 calls...
Resuming evaluation...


 18%|█▊        | 9/50 [01:48<16:32, 24.21s/it]

Evaluated artwork 9/50


 20%|██        | 10/50 [01:53<12:13, 18.35s/it]

Evaluated artwork 10/50


 22%|██▏       | 11/50 [01:58<09:12, 14.17s/it]

Evaluated artwork 11/50


 24%|██▍       | 12/50 [02:03<07:17, 11.53s/it]

Evaluated artwork 12/50


 26%|██▌       | 13/50 [02:08<05:48,  9.41s/it]

Evaluated artwork 13/50


 28%|██▊       | 14/50 [02:13<04:54,  8.18s/it]

Evaluated artwork 14/50


 30%|███       | 15/50 [02:20<04:33,  7.81s/it]

Evaluated artwork 15/50


 32%|███▏      | 16/50 [02:24<03:44,  6.60s/it]

Evaluated artwork 16/50

Waiting for 1 minute after 16 calls...
Resuming evaluation...


 34%|███▍      | 17/50 [03:29<13:17, 24.17s/it]

Evaluated artwork 17/50


 36%|███▌      | 18/50 [03:34<09:43, 18.25s/it]

Evaluated artwork 18/50


 38%|███▊      | 19/50 [03:39<07:22, 14.28s/it]

Evaluated artwork 19/50


 40%|████      | 20/50 [03:45<05:52, 11.75s/it]

Evaluated artwork 20/50


 42%|████▏     | 21/50 [03:50<04:47,  9.92s/it]

Evaluated artwork 21/50


 44%|████▍     | 22/50 [03:55<03:54,  8.36s/it]

Evaluated artwork 22/50


 46%|████▌     | 23/50 [04:02<03:34,  7.96s/it]

Evaluated artwork 23/50


 48%|████▊     | 24/50 [04:08<03:13,  7.44s/it]

Evaluated artwork 24/50

Waiting for 1 minute after 24 calls...
Resuming evaluation...


 50%|█████     | 25/50 [05:13<10:19, 24.77s/it]

Evaluated artwork 25/50


 52%|█████▏    | 26/50 [05:18<07:30, 18.77s/it]

Evaluated artwork 26/50


 54%|█████▍    | 27/50 [05:23<05:37, 14.68s/it]

Evaluated artwork 27/50


 56%|█████▌    | 28/50 [05:26<04:07, 11.24s/it]

Evaluated artwork 28/50


 58%|█████▊    | 29/50 [05:33<03:27,  9.88s/it]

Evaluated artwork 29/50


 60%|██████    | 30/50 [05:38<02:46,  8.31s/it]

Evaluated artwork 30/50


 62%|██████▏   | 31/50 [05:42<02:14,  7.10s/it]

Evaluated artwork 31/50


 64%|██████▍   | 32/50 [05:48<02:01,  6.74s/it]

Evaluated artwork 32/50

Waiting for 1 minute after 32 calls...
Resuming evaluation...


 66%|██████▌   | 33/50 [06:53<06:51, 24.21s/it]

Evaluated artwork 33/50


 68%|██████▊   | 34/50 [06:57<04:49, 18.08s/it]

Evaluated artwork 34/50


 70%|███████   | 35/50 [07:02<03:35, 14.39s/it]

Evaluated artwork 35/50


 72%|███████▏  | 36/50 [07:07<02:41, 11.52s/it]

Evaluated artwork 36/50


 74%|███████▍  | 37/50 [07:13<02:05,  9.67s/it]

Evaluated artwork 37/50


 76%|███████▌  | 38/50 [07:18<01:39,  8.30s/it]

Evaluated artwork 38/50


 78%|███████▊  | 39/50 [07:23<01:20,  7.34s/it]

Evaluated artwork 39/50


 80%|████████  | 40/50 [07:27<01:04,  6.46s/it]

Evaluated artwork 40/50

Waiting for 1 minute after 40 calls...
Resuming evaluation...


 82%|████████▏ | 41/50 [08:33<03:36, 24.08s/it]

Evaluated artwork 41/50


 84%|████████▍ | 42/50 [08:38<02:27, 18.49s/it]

Evaluated artwork 42/50


 86%|████████▌ | 43/50 [08:43<01:40, 14.34s/it]

Evaluated artwork 43/50


 88%|████████▊ | 44/50 [08:47<01:08, 11.49s/it]

Evaluated artwork 44/50


 90%|█████████ | 45/50 [08:53<00:48,  9.74s/it]

Evaluated artwork 45/50


 92%|█████████▏| 46/50 [08:59<00:34,  8.72s/it]

Evaluated artwork 46/50


 94%|█████████▍| 47/50 [09:04<00:22,  7.60s/it]

Evaluated artwork 47/50


 96%|█████████▌| 48/50 [09:09<00:13,  6.75s/it]

Evaluated artwork 48/50

Waiting for 1 minute after 48 calls...
Resuming evaluation...


 98%|█████████▊| 49/50 [10:15<00:24, 24.52s/it]

Evaluated artwork 49/50


100%|██████████| 50/50 [10:18<00:00, 12.37s/it]

Evaluated artwork 50/50

Evaluation process completed!

Rating distribution:
rating
ArtworkRating.GOOD         36
ArtworkRating.VERY_GOOD    13
Name: count, dtype: int64

Example evaluation (first row):
Rating: ArtworkRating.GOOD

Evaluation text:
## Evaluation

### STEP 1: Carefully read the primary text prompt to fully understand the core request and its constraints.
The prompt requests that I act as a tour guide in an art gallery and explain an artwork to a visitor. I must use the text details as the main source of information and use the image for visual context. The explanation should include the artist's background, the artwork's subject matter, the historical period and significance, the materials and techniques used, and its size and location. The core facts should come from the text, but the description can be enriched with observations from the image.

### STEP 2: Examine the supplementary image to understand the visual context provided.
The image depicts a scene of violence 




In [91]:
# Display summary statistics of ratings
print("\nRating distribution:")
print(evaluation_results_df['rating'].value_counts())



Rating distribution:
rating
ArtworkRating.GOOD         36
ArtworkRating.VERY_GOOD    13
Name: count, dtype: int64


Observations:
- The overall generation quality is good for image+text, while text-only is better
    - Text+image: Good (36), Very_Good (13)
    - Text-only: Good (31), Very_Good (19)
  
- The difference in generation quality may be due to the prompts used for generation and evaluation in each condition. Research shows different prompts significantly impact LLM evaluation results and consistency


# Part 5: Future Work

**1. Enhanced Evaluation Framework**
- Implement multi-LLM evaluation using GPT-4, Claude, and DeepSeek
- Create ensemble evaluation by bagging results from multiple LLMs
- Add confidence scores for evaluations

**2. Response Generation Improvements**
- Experiment with different length control parameters (50-1000 tokens)
- Test various temperature and top-p settings
- Implement dynamic length adjustment based on artwork complexity

**3. External Knowledge Integration**
- Add Google Search API for fact verification
- Implement RAG with art history knowledge base

**4. Interactive Features**
- Add follow-up question handling
- Implement context-aware response generation
- Create personalized explanation paths
- Add multilingual support