## Finetune CLIP on archaeological objects

October 8. Shawn Graham

This notebook downloads images from Open Context, reshapes the metadata into captions, and then uses Damian Stewart's hugginface_finetune_clip.py to retrain the `openai/clip-vit-base-patch32` model (see [this](https://github.com/damian0815/finetune-clip-huggingface/blob/main/huggingface_finetune_clip_runner.ipynb)). Other CLIP versions can be used, but so far the other ones I've tried take too much memory to be used in the free colab tier or on my m1 mac miniw 16 gb ram.

The first code block under creates captions from separate metadata fields and downloads, reshapes the results. Users should use the subsequent block ('better captions') instead for the open context materials. A subsequent section grabs materials from the MET and appends it to the training data for the fine tuning. This section can be ignored if desired.

In [1]:
!pip install pandas requests

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/ff/5a/c7359edec58500b35da8dc40a69ea7b0a3be48a479e1c91e8e8d0a2d9aa7/pandas-2.1.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Downloading pandas-2.1.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting numpy>=1.22.4 (from pandas)
  Obtaining dependency information for numpy>=1.22.4 from https://files.pythonhosted.org/packages/5c/ff/0e1f31c70495df6a1afbe98fa237f36e6fb7c5443fcb9a53f43170e5814c/numpy-1.26.0-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.5/53.5 kB[0m [31m347.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Obtaining dependency information for pytz>=2020.1 from https://files.pythonhosted.org/packages/32/4d/aaf7eff5deb402fd9a24a1449a8119f00d74ae9c2efa79f8ef9994261fc2/pytz-2023.3

## full json export from open context

If the data is made available as json with just the data in the various fields, there are some shennanigans necessary to get the data into shape with uuid as image name and captions derived from those fields.

If on the other hand the data is made available with the image file uri, the media uuid, and captions where some of the cidoc crm ontology fields have been used to craft more literate captions, then skip to the next section.

In [12]:
import requests
import pandas as pd

url = 'https://raw.githubusercontent.com/opencontext/archaeology-images-ai/main/json_data/artifact_images_w_descriptions.json'
data = requests.get(url).json()
df = pd.json_normalize(data)  # convert json to pandas DataFrame

In [13]:
df.rename(columns={'image_file__uri': 'image'}, inplace=True)
df

Unnamed: 0,image,media__uri,image_genre,image_type,subject__item_class__label,context___1,context___2,context___3,time_range,has_type,consists_of,origin_place,has_taxonomic_identifier,has_anatomical_identification,temporal_coverage,project_specific_descriptions
0,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/a9cedbad-e25b-4f...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),rock (inorganic material),,,,,"Artifact Name: Stamp Seal \n Material: Stone, ..."
1,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/1bbbca07-82f3-46...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),soapstone,,,,,Artifact Name: Stamp Seal \n Material: Steatit...
2,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/2062e3fa-41e2-d7...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),rock (inorganic material),,,,,"Artifact Name: Stamp Seal \n Material: Stone, ..."
3,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/2dc18114-4ddf-7c...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,pendants (jewelry),chert,,,,,Artifact Name: Pendant \n Material: Chert/Flint
4,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/d7e8b4e5-be3b-44...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,nails (fasteners),iron (metal),,,,,Artifact Name: Nail \n Material: Iron \n Dispo...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72357,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/d3620d27-cb41-44...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72358,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/41f03708-baa5-4d...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72359,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/453e04b2-7905-4e...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72360,https://artiraq.org/static/opencontext/interna...,https://opencontext.org/media/1840719d-2934-48...,archaeology,artifact,Object,Off World,International Space Station,Zvezda Service Module,,icons (devotional images); religions and reli...,,,,,,Location: Top center \n Item type: Icon \n Sec...


In [14]:
import os
import random
import requests

# Shuffle your DataFrame
df = df.sample(frac=1, random_state=76) #67 for the first 500; will a different seed give me different pics & I can collate results?

# Split the data into training and testing datasets
train_images = df.iloc[:1000]
test_images = df.iloc[1000:1100]

def download_images(df, dir_name):
    os.makedirs(dir_name, exist_ok=True)
    url_errors=[]
    for _, row in df.iterrows():
        url = row['image']
        # Get the extension for the image file from the URL
        extension = url.split('.')[-1]
        # Get the UUID for the media file
        media_uuid = row['media__uri'].split('/')[-1]
        file_name = f'{media_uuid}.{extension}'
        file_path = os.path.join(dir_name, file_name)

        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status()
        except (requests.exceptions.RequestException, requests.exceptions.Timeout):
            print(f'An error occurred while fetching: {url}')
            url_errors.append(url)
            continue

        with open(file_path, 'wb') as img_file:
            img_file.write(response.content)
    
    print(f"Downloaded images in directory '{dir_name}'. Number of url errors: {len(url_errors)}")

download_images(train_images, 'images') # downloads training images to 'images' folder
download_images(test_images, 'testing') # downloads testing images to 'testing' folder

An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-d-9-11-4-4-p-1jpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-24-19710499bwjpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-c-1-1093-25-4-p-2jpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-24-19720119chgjpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-24-20000037sideajpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-f-1-1021-1118-1-p-2jpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-d-6-102-2-19-p-6jpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-24-88-125bottomjpg/full/675,/0/default.jpg
An error

In [15]:
import re
import string
def create_metadata(df, dir_name):
    # Make a copy of the dataframe to avoid modifying the original one
    df = df.copy()

    # fill NaN values with empty string
    df.fillna('', inplace=True)

    # append values of required columns into new 'caption' column
    df['caption'] = 'A photograph of ' + df['consists_of'].astype(str) \
                    + ', ' + df['project_specific_descriptions'].astype(str) \
                    + ' dating to ' + df['time_range'].astype(str) \
                    + ' from ' + df['context___1'].astype(str) \
                    + ', ' + df['context___2'].astype(str) \
                    + ', ' + df['context___3'].astype(str)
    df['caption'] = df['caption'].replace('\n', ' ')
    df['caption'] = df['caption'].replace('False', ' ')
    df['caption'] = df['caption'].replace('True', ' ')
    # Remove all other punctuation
    df['caption'] = df['caption'].apply(lambda x: re.sub(r'[{}]'.format(string.punctuation), ' ', x)).str.strip()

    # Rewrite 'image' column to just contain the filename
    df['image'] = df.apply(lambda row: f"{row['media__uri'].split('/')[-1]}.{row['image'].split('.').pop()}", axis=1)

    # reshaping data to contain only 'image' and 'caption'
    df = df[['image', 'caption']]
    df.loc[:, 'image'] = dir_name + '/' + df['image'].astype(str)

    return df

df = create_metadata(train_images, 'images')
with open('train.json', 'w') as file:
    df.to_json(file, orient='records', lines=True)

testdf = create_metadata(test_images, 'testing')
with open('test.json', 'w') as file:
    testdf.to_json(file, orient='records', lines=True)

## Data from Open Context with better captions

If the export data looks like this:

```json
{
        "image_file__uri":"https:\/\/iiif.archivelab.org\/iiif\/opencontext-14-b2009121-5jpg\/full\/675,\/0\/default.jpg",
        "media__uuid":"6fb98a27-a3a9-465e-6804-84e09f23b704",
        "media__uri":"https:\/\/opencontext.org\/media\/6fb98a27-a3a9-465e-6804-84e09f23b704",
        "caption":"An image of an archaeological artifact found at Tell en-Nasbeh, a place in Palestinian Authority which is more generally located in Asia. The artifact mainly consists of ceramic (material). Additional attributes that describe the artifact include: Catalog: Object \n Catalog Type: Archaeology \n Collection: Tell en-Nasbeh Collection \n Cultural Period: Iron IIC \n Conservation Condition: Good \n Stratum: 3C-3A \n Primary Location: Holbrook Hall, Pacific School of Religion \n Object Name: Loom Weight \n Object Sub-category: Ceramic -- Tool \n Subjects and Themes: Weaving and textiles \n Category Type: Ceramic \n Excav. Ceramic Type: M15 \n Photo Number: 254 \n Material: Ceramic \n Decoration: None \n Completeness: Whole \n Manufacture: Handmade"
    }
```
then use the following code snippet to download images and create the training/testing json datasets.

In [25]:
import os
import json
import requests
import pandas as pd
from urllib.parse import urlparse
from urllib.request import urlretrieve
from urllib.error import HTTPError, URLError
from sklearn.model_selection import train_test_split
import concurrent.futures

# Load JSON from remote URL
url = "https://raw.githubusercontent.com/opencontext/archaeology-images-ai/main/json_data/artifact_images_w_sentence_captions.json"
response = requests.get(url)
data = response.json()

# Randomly select records
df = pd.DataFrame(data)
train_df, rem_df = train_test_split(df, train_size=5000, random_state=42)
test_df = rem_df.sample(500, random_state=42)

def download_and_rename(row, folder):
    os.makedirs(folder, exist_ok=True)
    uri = row['image_file__uri']
    # Check if uri exists and is a string
    if uri and isinstance(uri, str):
        uuid = row['media__uuid']
        caption = row['caption']
        parse_object = urlparse(uri)
        _, ext = os.path.splitext(parse_object.path)
        # Make sure uuid and ext are strings
        if not isinstance(uuid, str):
            uuid = str(uuid)
        if isinstance(ext, bytes):
            ext = ext.decode("utf-8") 
        new_image_name = uuid + ext
        new_image_path = os.path.join(folder, new_image_name)

        try:
            urlretrieve(uri, new_image_path)
            return {"image": new_image_path, "caption": caption}

        except (HTTPError, URLError) as error:
            print(f"Download error for URL {uri}")
            print(error)
            return None
    else:
        return None

# Writing to 'jsonl' files
def write_to_jsonl(new_data, jsonl_file):
    with open(jsonl_file, 'w') as file:
        for json_dict in new_data:
            line = json.dumps(json_dict)
            file.write(line + "\n")

# Process train and test data
with concurrent.futures.ThreadPoolExecutor() as executor:
    train_data = list(executor.map(download_and_rename, [row for _, row in train_df.iterrows()], ['images']*len(train_df)))
    test_data = list(executor.map(download_and_rename, [row for _, row in test_df.iterrows()], ['testing']*len(test_df)))

# Write train/test data to jsonl files
write_to_jsonl(train_data, 'train.json')
write_to_jsonl(test_data, 'test.json')

Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/19820174BACK.jpg
HTTP Error 404: Not Found
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20020015HEAD.jpg
HTTP Error 404: Not Found
Download error for URL https://iiif.archivelab.org/iiif/opencontext-16-250jpg/full/675,/0/default.jpg
HTTP Error 404: NOT FOUND
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20020004PROFILE.jpg
HTTP Error 404: Not Found
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20030122FRONT.jpg
HTTP Error 404: Not Found
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20020028BOTTOM.jpg
HTTP Error 404: Not Found
Download error for URL https://iiif.archivelab.org/iiif/opencontext-16-258jpg/full/675,/0/default.jpg
HTTP Error 404: NOT FOUND
Download error for URL https://iiif.archivelab.org/iiif/op

## Add Other Sources of Imagery to Complement?

Let's try the met

In [None]:
!pip install jsonlines
!pip install retry

In [16]:
import requests
import json
import jsonlines
import random
from concurrent.futures import ThreadPoolExecutor

# Function to fetch object data
def fetch_object_data(object_id):
    object_response = requests.get(f"{base_url}objects/{object_id}")
    return object_response.json()

# Define base URL for the Met's API
base_url = 'https://collectionapi.metmuseum.org/public/collection/v1/'

# Define our search term
search_term = 'archaeology'

allowed_departments = ["Ancient Near Eastern Art", "Egyptian Art", "Greek and Roman Art"]

# Generate the search URL
search_url = f"{base_url}search?q={search_term}"

# Make the GET request to the Met's API search endpoint
response = requests.get(search_url)

# Parse the response as JSON
data = response.json()

# Get a random sample of 100 object IDs, if there are at least 100 object IDs. 
# Otherwise, get all object IDs.
object_ids_sample = random.sample(data['objectIDs'], min(1000, len(data['objectIDs'])))

# Open the jsonlines file in write mode
with jsonlines.open('METoutput.json', mode='w') as writer:
    # Use a ThreadPoolExecutor for parallel requests
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Fetch all object data in parallel
        for object_data in executor.map(fetch_object_data, object_ids_sample):
            # If object's department in allowed departments and there's an image for this object
            if (object_data.get('department') in allowed_departments) and object_data['primaryImage']:
                # Create a list with all components of the caption
                caption_components = [
                    object_data['title'],
                    f"a {object_data['objectName']}" if object_data.get('objectName') else None,
                    f"from the {object_data['culture']}" if object_data.get('culture') else None,
                    f"dating to the {object_data['period']}" if object_data.get('period') else None,
                    object_data['dynasty'] if object_data.get('dynasty') else None,
                    object_data['reign'] if object_data.get('reign') else None,
                    f"({object_data['objectDate']})" if object_data.get('objectDate') else None,
                    f"created by {object_data['artistDisplayName']}" if object_data.get('artistDisplayName') else None,
                    f"in {object_data['country']}" if object_data.get('country') else None,
                    object_data['region'] if object_data.get('region') else None
                ]

                # Remove None elements from the list
                caption_components = [component for component in caption_components if component is not None]

                # Create the caption
                caption = ', '.join(caption_components) + '.'

                # Create the record
                record = {
                    'image': object_data['primaryImage'],
                    'caption': caption
                }

                # Write to jsonlines file
                writer.write(record)

In [17]:
import os
import json
import jsonlines
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from sklearn.model_selection import train_test_split
import requests
from retry import retry

# Function to download images
@retry(tries=3, delay=2)
def download_image(image_url, local_path):
    try:
        response = requests.get(image_url, stream=True)
        response.raise_for_status()
        with open(local_path, 'wb') as f:
            f.write(response.content)
    except requests.exceptions.RequestException as err:
        print ("Requests Error-URL {0}: {1}".format(image_url,str(err)))
        raise Exception(err)

def process_lines(lines, dataset):
    # List to store records
    records = []

    # Create the directory if it doesn't exist
    os.makedirs(dataset, exist_ok=True)

    for line in lines:
        # Parse the line as JSON
        data = json.loads(line)

        # Define the local path
        image_url = data['image']
        local_filename = image_url.split('/')[-1]  # Use the last part of the URL as the filename
        local_path = os.path.join(dataset, local_filename)

        # Append this task to the list
        records.append((image_url, local_path, data['caption']))

    # Create ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Download images in parallel
        executor.map(lambda x: download_image(x[0], x[1]), records)

    # Open the corresponding jsonl file in write mode
    with jsonlines.open(f'{dataset}_output.json', mode='w') as writer:
        # Write records to file
        for _, local_path, caption in records:
            record = {
                'image': local_path,
                'caption': caption
            }
            writer.write(record)

# Read lines from METoutput.jsonl file
with open('METoutput.json', 'r') as f:
    lines = f.readlines()

# Split into train and test sets
train_lines, test_lines = train_test_split(lines, test_size=0.20)

# Process training and test sets
process_lines(train_lines, 'METtrain')
process_lines(test_lines, 'METtest')

In [18]:
# Open 'train.json' in append mode and 'METtrain_output.jsonl' in read mode
with open('train.json', 'a') as train_file, open('METtrain_output.json', 'r') as met_file:
    # Iterate over the lines in met_file
    for line in met_file:
        # Write each line to train_file
        train_file.write(line)

# Fine Tune Clip

In [None]:
!pip install torchvision datasets Pillow
!pip install -q git+https://github.com/huggingface/transformers
!pip install accelerate -U

In [19]:
# test loading it back in
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.json")
print(f"first image: {dataset['train'][0]['image']}, caption: '{dataset['train'][0]['caption']}'")


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

first image: images/5118cd46-6af6-4e71-8043-fa73020a418e.JPG, caption: 'An image of an archaeological artifact found at PKAP Survey Area, a place in Cyprus which is more generally located in Europe. Additional attributes that describe the artifact include: Chronotype: Phocaean Ware 10'


In [5]:
!git clone https://github.com/damian0815/finetune-clip-huggingface.git

Cloning into 'finetune-clip-huggingface'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 19 (delta 6), reused 17 (delta 4), pack-reused 0[K
Receiving objects: 100% (19/19), 13.79 KiB | 336.00 KiB/s, done.
Resolving deltas: 100% (6/6), done.


In [6]:
!mkdir results

In [21]:
repo_id =  "openai/clip-vit-base-patch32" # this was the clip version for stable diffusion 1.5
#repo_id = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" # this was the clip version for stable diffusion 2.0 onwards
# however, using it requires more memory than I have available. More than what's available free tier google colab too.
output_folder = "results"
batch_size = 100
num_train_epochs = 5
out_json = "train.json"

In [None]:
print(f"Finetuning {repo_id} for {num_train_epochs} epochs with batch size {batch_size}, and then saving output to {output_folder}.")
!python -W ignore finetune-clip-huggingface/huggingface_finetune_clip.py \
    --output_dir {output_folder} \
    --model_name_or_path {repo_id} \
    --train_file {out_json} \
    --image_column image \
    --overwrite_output_dir=True \
    --max_seq_length=77 \
    --num_train_epochs={num_train_epochs} \
    --caption_column caption \
    --remove_unused_columns=False \
    --do_train \
    --per_device_train_batch_size={batch_size} \
    --learning_rate="5e-5" --warmup_steps="2" --weight_decay 0.2
print("--\nDONE")
print(f"If it worked, trained data should be in {output_folder}")

Finetuning openai/clip-vit-base-patch32 for 5 epochs with batch size 100, and then saving output to results.
Filter: 100%|██████████████████████| 5220/5220 [00:00<00:00, 6050.89 examples/s]
Running tokenizer on train dataset: 100%|█| 5204/5204 [00:00<00:00, 12214.67 exa
  6%|██▍                                     | 16/265 [04:51<1:14:13, 17.89s/it]

In [169]:
!zip -r archaeai.zip results/pytorch_model.bin results/config.json

updating: results/pytorch_model.bin (deflated 7%)
updating: results/config.json (deflated 46%)


In [170]:
from google.colab import files
files.download("archaeai.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

GPT4's 'insight' re the warning in the training script:

> That warning occurs because Python's `pickle` module is not able to serialize the function passed to `.map()`. In this case, the function is `transform_images`.

> Because the function `transform_images` is defined within the `main()` function, it technically includes references to all the local variables in `main()`, making it a closure. Closures are not serializable with `pickle`.

> A way around this would be to move the `transform_images` function out of the scope of the `main()` function. This means the function would be more of a "global" function rather than a local one, and it should remove the warning.

> Here's how you could modify it:

```python
def transform_images(examples, image_column, image_transformations):
    images = [read_image(image_file, mode=ImageReadMode.RGB) for image_file in examples[image_column]]
    examples["pixel_values"] = [image_transformations(image) for image in images]
    return examples
```

> And then within your main function, you can use it within the `.map()` function as:

```python
train_dataset.set_transform(lambda examples: transform_images(examples, image_column, image_transformations))
eval_dataset.set_transform(lambda examples: transform_images(examples, image_column, image_transformations))
```

>By passing the parameters explicitly and moving the function definition outside of the `main` function, it ensures that the function can be serialized and cached properly.