This is the repository for the paper "Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP" [link forthcoming], accepted to the 2024 Computational Humanities Research (CHR) conference.
This paper explores the use of multimodal machine learning to facilitate search and discovery in large-scale map collections. We implement three search strategies using CLIP (Contrastive Language-Image Pre-training) embeddings on 562,842 images of maps from the Library of Congress:
- Text-input search
- Image-input search (reverse image search)
- Combined text and image input search
Our key contributions include:
- CLIP embeddings generated for 562,842 map images
- Search implementation allowing natural language, visual, and multimodal inputs
- Dataset of 10,504 map-caption pairs for fine-tuning CLIP
- Code released as Jupyter notebooks in the public domain
The paper demonstrates the potential for searching maps beyond catalog records and existing metadata. We consulted with Library of Congress staff to explore example searches and evaluate utility and followed the LC Labs AI Planning Framework to ensure responsible and ethical AI practices. While initial fine-tuning experiments yielded mixed results, we believe further work could reduce noise in searches. This research addresses the challenge of improving discoverability in rapidly growing digital collections, with implications for galleries, libraries, archives and museums worldwide.
Helper files (those too big for GitHub) can be found at the project's public Zenodo. We recommend placing beto
, beto_idx
, and beto_normalized
into the search
folder for compatibility.
The embeddings
folder contains scripts that accept resource URLs (in the form specified in p1_map_file_list.csv
and return CLIP-generated embeddings. embed.stripped
has functionality to load a model checkpoint for fine-tuning experiments. create_beto
accepts the JSON files generated by embed_*
and creates beto
, beto_idx
, and beto_normalized.
The fine-tuning
folder contains script for dataset creation, a notebook for fine-tuning incrementally, and the fine-tuning script. Fine-tuning accepts a range of image-text pairs with user-specified model hyperparameters.
The documentation
folder hosts the sheets for this project's complaince with the Library of Congress's AI planning framework, including assessments on the use case, data collection, and risk.
Lastly, search
loads in beto
and beto_idx
to accept user-specified search inputs. The first two cells must be run for the search cells to work. The third cell imports a fine-tuned model which is not included in this repository.
To use the search notebook out of the box,
- Clone this repository,
- Download
beto
andbeto_idx
from the Zenodo, - Move
beto
andbeto_idx
to thesearch
folder, - Run the first two cells in
search
(ensure all imports are resolved), - Run the cell under the corresponding search type.
For image-input search, the image input is not restricted to images from the Library of Congress.